Soumya D. Sanyal

Jan 10, 2016

Looking at the dataset - correlations

In this post, I'm going to try and see if I can elicit any relationships between the explanatory features and the explained features (expenditure).

This part of the project serves to aid in getting to actually know the data, and in hypothesis making, hypothesis refinement, and model selection; it essentially lays the foundation for the "story" that the project will tell.

I find this stage of the modeling process to be one of the most interesting. It's practically impossible to know if your priors about the problem you're looking at will influence the modeling process - I think they almost certainly will. At the same time, there's dramatic tension - will my hunch about the problem be supported by the data or will the data show me that it's completely false? Will another insight rise to take it's place? Will there be any story at all? All of these outcomes are entirely possible at this stage, and it's quite nerve wracking.

In any case, I can do my best by examining the data for as much supporting (or refuting) evidence in as objective a manner as possible.

Here are some of the questions I'm hoping to answer at the end of this post:

  1. What explanatory features are predictive (if any)?
  2. What explains the statistical variation in the dataset? What does the randomness look like? Is there a pattern? Surely it's not all just white noise!
  3. Which features are mutually correlated, and which are mutually uncorrelated? Do any features combine to create stronger relationships?

Next, we've accumulated a lot of boilerplate code that I need to load to start up my modeling session at this point. Rather than subject you to a lot of scrolling, I'm simply going to save it to a python module called blogging.py, and import it in one step. If you'd like to take a look at the environment I'm importing to build the analysis in this post, it's here.

In [1]:
from blogging import *

The next step in my analysis is to explore the data a bit more. I want to see if I can get any more insight into what the underlying data generating process is; the more I can rely on a clean look at the data, the better I'll be able to dispel my own incorrect assumptions about what is or should be happening.

One problem I do have to contend with in working with this dataset is that it comes as weighted survey data. Working with weighted data can be quite tricky. In the very first instance, since the unweighted data does not (by definition) accurately represent the population of interest, it's impossible to draw valid inferences from the standard exploratory (or even regression) procedure without accounting for the weights.

When constructing summary data, as in a histogram, or constructing a linear function of the data, this is not much of a problem. Any linear function $f(x)$ of the data is just as easy to compute as $f(Wx)$, where $w$ is a weight matrix, and variances will just be quadratic in $W$ as expected. However, there are myriad other challenges in working with weighted data; there's a relatively recent discussion of some of these challenges in this paper by Andrew Gelman.

For my immediate purposes - my Insight project - the challenge was to come up with some rough and ready ways to accommodate these weights. While the histogram methods in the Python ecosystem have options for specifying item level weights, not all EDA function calls do. In particular, scatterplots don't (as far as I know). Most model estimation calls that I looked into before starting this project did have some accommodations for item weights, so I figured that if I could get past the EDA stage, I might be fine.

I needed to find a way of getting a sense of the pairwise correlations in the dataset while accounting for the weights. In the end, I found that matplotlib hexbin plots had some support for item weights, and I figured that adding the weighted sum of points to hexbin cells would do well enough without my having to write my own visualization functions. In the end, this is what I constructed:

In [2]:
def look_at_pairs(x, y, x_range, y_range, w, colorbar=False, gridsize=200, bins=100, mincount=None):
    x_min, x_max= x_range[0], x_range[1]
    y_min, y_max = y_range[0], y_range[1]
    plt.hexbin(x, y, C = w, gridsize = gridsize, bins = bins,
       xscale = 'linear', yscale = 'linear',
       cmap=None, norm=None, vmin=None, vmax=None,
       alpha=1, linewidths=None, edgecolors='none',
       reduce_C_function = np.sum, mincnt=mincount, marginals=False,
          )
    plt.ylim([y_min,y_max])
    plt.xlim([x_min,x_max])
    if colorbar:
        plt.colorbar()
    plt.show()

Let's try this out on a an example pair of features.

In [3]:
look_at_pairs(x=data["AGE_AS_OF_12/31/13_(EDITED/IMPUTED)"], 
              y=data["TOTAL_OFFICE-BASED_EXP_13"], 
              x_range=[0,85], 
              y_range=[0,200000],
              w=data["FINAL_PERSON_WEIGHT_2013"],
              colorbar=True,
              gridsize=500,
              bins=85,
             )

In this graph, greater (weighted) mass in any one hexbin results in a higher color count (according to the colorbar scale on the right). More tightly packed data results in a greater plot density.

There looks to be a general tendency for older people to spend more on office based services in this picture, but my sense of the trend is obscured by the outliers above 100k. Of these outliers, most are in the 70-80 year range, with one below 20 years. There's not enough data for me to make a robust case about whether age correlates well with the incidence of really high expenditure, so I'm going to try looking at this again after cutting off the outliers.

In [4]:
look_at_pairs(x=data["AGE_AS_OF_12/31/13_(EDITED/IMPUTED)"], 
              y=data["TOTAL_OFFICE-BASED_EXP_13"], 
              x_range=[0,85], 
              y_range=[0,100000],
              w=data["FINAL_PERSON_WEIGHT_2013"],
              colorbar=True,
              gridsize=500,
              bins=85,
             )

This is helpful. It looks like a lot of weighted mass (the aqua region at the bottom) is concentrated near zero expenditure. However, it highlights the fact that expenditure trends higher with age (as does variance in expenditure).

It looks like this is a fairly useful visualization technique as far as quick and dirty goes.

Here are the other groups of correlations I want to inspect:

  1. Income
  2. Utilization
  3. Health conditions
  4. Duration without insurance
  5. Age of diagnosis
  6. BMI
  7. Demographic information
  8. Expenditures.

I also want to look at the four expenditure categories at the same time. Here's a quick method to help me do that:

In [5]:
def four_correlations(x,
                      x_range,
                      w,
                      xticks=None,
                      office_range=[0,100000],
                      outpatient_range=[0,26000],
                      inpatient_range=[0,320000],
                      er_range=[0,70000],
                      colorbar=False, 
                      gridsize=200, 
                      bins=100, 
                      mincount=None):
    
    term=x.name
    plt.figure(figsize=(15,15))
    
    for pos,y in [(1,data["TOTAL_OFFICE-BASED_EXP_13"]),
                  (2,data["TOTAL_OUTPATIENT_PROVIDER_EXP_13"]),
                  (3, data["TOT_HOSP_IP_FACILITY_+_DR_EXP_13"]),
                  (4, data["TOTAL_ER_FACILITY_+_DR_EXP_13"]),
                 ]:
        plt.subplot(2,2,pos)
        plt.hexbin(x,
                   y,
                   C = w,
                   gridsize = gridsize,
                   bins = bins,
                   xscale = 'linear',
                   yscale = 'linear',
                   cmap=None,
                   norm=None,
                   vmin=None,
                   vmax=None,
                   alpha=1,
                   linewidths=None,
                   edgecolors='none',
                   reduce_C_function = np.sum,
                   mincnt=mincount,
                   marginals=False,
                   )
        plt.title(y.name)

        if pos==1:
            plt.ylim(office_range)
        if pos==2:
            plt.ylim(outpatient_range)
        if pos==3:
            plt.ylim(inpatient_range)
        if pos==4:
            plt.ylim(er_range)
            
        plt.xlim(x_range)
        plt.xlabel(term)
        
        if xticks:
            plt.xticks(xticks[0],[thing.replace("NON-HISPANIC ","")[:15] for thing in xticks[1]], rotation=40)
        if colorbar:
            plt.colorbar()
        
        
    plt.tight_layout()
    plt.show()
    
    
In [6]:
four_correlations(data["AGE_AS_OF_12/31/13_(EDITED/IMPUTED)"], x_range=[0,85], w=data["FINAL_PERSON_WEIGHT_2013"],
                 colorbar=True,)

It looks like the same pattern holds for the other expenditure categories as for office expenditure when it comes to age.

Looking at income:

In [7]:
four_correlations(data["FAMILY\'S_TOTAL_INCOME"],
                  x_range=[0,500000],
                  w=data["FINAL_PERSON_WEIGHT_2013"],
                  office_range=[0,100000],
                  outpatient_range=[0,40000],
                  inpatient_range=[0,100000],
                  er_range=[0,40000],
                 colorbar=True)

Here you get the result that while the level of spending moves away from zero as income rises, the variance of expenditure in all four categories declines with income. An important question is whether the reduced variation in spending at higher incomes is simply a result of having fewer people at those income levels, or if there is a systematic effect (more effective health management, better bargaining power).

Looking at how utilization correlates with expenditure:

In [8]:
for term in utilizations:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[x_min,x_max]
    four_correlations(data[term], x_range=x_range, w=data["FINAL_PERSON_WEIGHT_2013"],
                     colorbar=True)
    print("\n\n\n")












I get the following out of these plots:

  1. Office based utilization and expenditure are positively correlated. Office based utilization is negatively correlated with all other expenditure.
  2. Outpatient utilization doesn't strongly correlate with outpatient expenditure. Outpatient utilization correlates somewhat negatively with other categories of expenditure.
  3. ER visits correlate negatively with other expenditures, and not particularly at all with ER expenditure.
  4. Inpatient visits correspond positively with inpatient utilization, and negatively with all other expenditure.

Next, I'll look at chronic health diagnoses. There are over 40 in my list, so it'll be useful to break them into groups.

Before we take a look at these plots, I want to quickly mention a very counterintuitive finding that I came across when I looked at these. For many of these chronic conditions, the subpopulation that does not have a positive diagnosis has a much heavier right tail than the subpopulation with the chronic condition.

On the face of it, it looked like people overall were at greater risk of spending more if they did not have a chronic illness than if they did. The paradox disappears when you zoom in on these plots around the first $\$10,000$ spent; it shows that many more people without a chronic condition are clustered at $\$0$ spent, relatively speaking. It's interesting to conjecture what may cause the thinner tails for chronic sufferers - one explanation may just be that the larger subpopulation is naturally accompanied by a larger variance.

The plots below shows the scatterplots to this limited scale for emphasis.

In [9]:
cancer=[term for term in health if "CANCER" in term]
lung=[term for term in health if "LUNG" in term or "BRONCH" in term or "ASTHMA" in term
     or "EMPHYSEMA" in term
     or "INH" in term]
cardio=[term for term in health if "BLOOD" in term or "CHOLESTEROL" in term or "CORON" in term or "HEART" in term
       or "STROKE" in term
       or "DIABETES" in term
       or "ANGINA" in term]
joint=[term for term in health if "JOINT" in term or "ARTHR" in term]
other=[term for term in health if term not in joint+cardio+lung+cancer]
In [10]:
for term in cancer:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,5000],
                      outpatient_range=[0,5000],
                      inpatient_range=[0,5000],
                      er_range=[0,10000],
                      colorbar=True,
                     )
    print("\n\n\n")







































Conclusions:

  1. A cancer diagnosis overall drives expenditure for office based services and inpatient services, and has less of a clear effect on outpatient spending and ER spending. For these latter two categories, there is greater variance in spending for the subpopulation without a positive cancer diagnosis.
  2. There is very little mass in the graphs measuring spending against the various subdiagnoses of cancer. It seems likely that there will be a corresponding lack of statistical significance of these features in any analysis I do, and it may be well to leave those out and just use the overall cancer diagnosis feature.
In [11]:
for term in lung:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")




































Conclusions:

  1. Diagnoses of emphysema, bronchitis, and asthma drive spending on office services and inpatient services. They have less of an effect on outpatient and ER services. The right tails in these latter categories is heavier in the negative group.
  2. The other features do not seem to be as predictive of increased spending.
In [12]:
for term in cardio:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")



























Conclusions:

  1. Diagnoses of high blood pressure, high cholesterol, coronary heart disease, angina, heart attack, other heart disease, stroke, and diabetes drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
  2. A multiple diagnosis of high blood pressure is probably not very predictive.
In [13]:
for term in joint:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")









Conclusions:

  1. Diagnoses of joint pain and arthritis drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
In [14]:
for term in other:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[-2,3]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")















Conclusions:

  1. Diagnoses of ADD/ADHD, limitations in physical functioning, and incidence of pregnancy drive spending on office services and inpatient services, while having a limited effect on the other two categories of spending, in which they are associated with thinner right tails.
In [15]:
for term in durations:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[0,52]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      #xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,40000],
                      outpatient_range=[0,20000],
                      inpatient_range=[0,800000],
                      er_range=[0,10000],
                      colorbar=True,
                     )
    print("\n\n\n")



Conclusion:

  1. It doesn't look like spending is correlated with time elapsed without health insurance.
In [16]:
for term in age_diagnosed:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[0,85]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      #xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,60000],
                      outpatient_range=[0,30000],
                      inpatient_range=[0,100000],
                      er_range=[0,40000],
                      colorbar=True,
                     )
    print("\n\n\n")




































Conclusion:

  1. There doesn't seem to be a strong correlation between the age of diagnosis for any of these chronic conditions and annual spending on any category of healthcare.
In [17]:
for term in demog:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[x_min-1,x_max+1]
    d={term:[list(interpretation[term].keys()),list(interpretation[term].values())] for term in interpretation}
    d["categorical"]=[[-1,1,2],["inapplicable","yes","no"]]
    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      xticks=d[term] if term in interpretation else d["categorical"],
                      office_range=[0,10000],
                      outpatient_range=[0,10000],
                      inpatient_range=[0,10000],
                      er_range=[0,20000],
                      colorbar=True,
                     )
    print("\n\n\n")









Conclusions:

  1. No features except insurance status generate significant differences in outpatient expenditure.
  2. ER spending is affected by race, marital status, employment status, self-employment status, holding multiple jobs, union status, whether you are offered health insurance by your employer, your overall insurance status (notably whether you have private insurance or a public option), whether your household uses an FSA, where you live (notably, there is more zero clustering for ER spending in the South, and less in the Northeast), and educational attainment.
  3. Both office and inpatient spending are affected by race, marital status, employment status, having more than one job, being self employed, union status, whether health insurance is offered through employment, whether public or private options for health insurance are used, whether FSA's are used, insurance status, census region, and educational attainment.

Finally, a look at whether body mass index affects spending:

In [18]:
for term in bmi:
    x_min=data[term].min()
    x_max=data[term].max()
    x_range=[0,50]

    four_correlations(data[term],
                      x_range=x_range,
                      w=data["FINAL_PERSON_WEIGHT_2013"],
                      #xticks=[[-1,1,2],["inapplicable","yes","no"]],
                      office_range=[0,100000],
                      outpatient_range=[0,40000],
                      inpatient_range=[0,100000],
                      er_range=[0,60000],
                      colorbar=True,
                     )
    print("\n\n\n")



From looking at this, there doesn't seem to be a strong correlation between spending and BMI in any category. I find this somewhat counterintuitive, since obesity is supposed to be correlated with poor cardiovascular health. I'm inclined to leave this feature in to see if there are any interesting interaction effects that may turn up in the modeling process.

At this point we have a better view of the data, and a sense of what features we expect to be predictive. This looks like a good foundation for a pretty rich modeling exercise.

In the next post, I'll write about the model selection process.

Click to read and post comments

Dec 24, 2015

Looking at the dataset - univariate plots

In this post, I'm going to look at the explanatory features in the dataset.

First, boilerplate:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.pylab.rcParams['figure.figsize'] = (10, 6)

Let's get our dataset, variables and functions from before.

In [2]:
data=pd.read_pickle("/home/soumya/research/insight/insight_project/modeling/data/dsm_data_scoped_variables.pkl")
In [3]:
variables=[]
with open("/home/soumya/research/insight/insight_project/modeling/code/dsm_exog_.txt","r") as f:
    lines=f.readlines()
    for line in lines:
        variables.append(line.strip().upper())
for variable in variables:
    print(variable)
AGE_AS_OF_12/31/13_(EDITED/IMPUTED)
SEX
RACE/ETHNICITY_(EDITED/IMPUTED)
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED)
FAMILY'S_TOTAL_INCOME
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3
HIGH_BLOOD_PRESSURE_DIAG_(>17)
CORONARY_HRT_DISEASE_DIAG_(>17)
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE
MULT_DIAG_HIGH_BLOOD_PRESS_(>17)
AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE
ANGINA_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-ANGINA
HEART_ATTACK_(MI)_DIAG_(>17)
AGE_OF_DIAGNOSIS-HEART_ATTACK(MI)
OTHER_HEART_DISEASE_DIAG_(>17)
AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE
STROKE_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-STROKE
EMPHYSEMA_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-EMPHYSEMA
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3
HIGH_CHOLESTEROL_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL
CANCER_DIAGNOSIS_(>17)
CANCER_DIAGNOSED_-_BLADDER_(>17)
CANCER_DIAGNOSED_-_BREAST_(>17)
CANCER_DIAGNOSED_-_CERVIX_(>17)
CANCER_DIAGNOSED_-_COLON_(>17)
CANCER_DIAGNOSED_-_LUNG_(>17)
CANCER_DIAGNOSED_-_LYMPHOMA_(>17)
CANCER_DIAGNOSED_-_MELANOMA_(>17)
CANCER_DIAGNOSED_-_OTHER_(>17)
CANCER_DIAGNOSED_-_PROSTATE_(>17)
CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17)
CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17)
CANCER_DIAGNOSED_-_UTERUS_(>17)
DIABETES_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-DIABETES
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3
ARTHRITIS_DIAGNOSIS_(>17)
AGE_OF_DIAGNOSIS-ARTHRITIS
ASTHMA_DIAGNOSIS
AGE_OF_DIAGNOSIS-ASTHMA
DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1
DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3
ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1
USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3
USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3
EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3
NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3
HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3
EVER_USED_PEAK_FLOW_METER_-_RD_5/3
ADHDADD_DIAGNOSIS_(5-17)
AGE_OF_DIAGNOSIS-ADHD/ADD
PREGNANT_DURING_REF_PERIOD_-_RD_3/1
LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1
EMPLOYMENT_STATUS_RD_3/1
HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE
SELF-EMPLOYED_AT_RD_3/1_CMJ
CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ
INDUSTRY_GROUP_RD_3/1_CMJ
UNION_STATUS_AT_RD_3/1_CMJ
OCCUPATION_GROUP_RD_3/1_CMJ
HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED)
HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED)
EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ
FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013
ANYONE_IN_RU_HAVE_FSA_-_R3/1
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL
PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3
PUBLIC_INS_ANY_TIME_IN_R5/R3
INSURED_ANY_TIME_IN_R3/1
ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1
#_OFFICE-BASED_PROVIDER_VISITS_13
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13
#_EMERGENCY_ROOM_VISITS_13
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013
FINAL_PERSON_WEIGHT_2013
TOTAL_OFFICE-BASED_EXP_13
TOTAL_OUTPATIENT_PROVIDER_EXP_13
TOT_HOSP_IP_FACILITY_+_DR_EXP_13
TOTAL_ER_FACILITY_+_DR_EXP_13
CENSUS_REGION_AS_OF_12/31/13
EDUCATION_RECODE_(EDITED)
In [4]:
interpretation={}

interpretation["categorical"]={
                1: "Yes",
                2: "No",
                -1: "Inapplicable",
                -2: "Determined in previous round",
                -7: "Refused",
                -8: "Don't know",
                -9: "Not ascertained",
                -10: "HOURLY WAGE >= $76.96",
                -13: "INITIAL WAGE IMPUTED"
                }

interpretation["RACE/ETHNICITY_(EDITED/IMPUTED)"]={1: "HISPANIC",
                        2: "NON-HISPANIC WHITE ONLY",
                        3: "NON-HISPANIC BLACK ONLY",
                        4: "NON-HISPANIC ASIAN ONLY",
                        5: "NON-HISPANIC OTHER RACE OR MULTIPLE RACE"
                        }

interpretation["SEX"]={1: "MALE", 2: "FEMALE"}

interpretation["MARITAL_STATUS-12/31/13_(EDITED/IMPUTED)"]={-9: "NOT ASCERTAINED",
                           -8: "DK",
                           -7: "REFUSED",
                           1: "MARRIED",
                           2: "WIDOWED",
                           3: "DIVORCED",
                           4: "SEPARATED",
                           5: "NEVER MARRIED",
                           6: "UNDER 16 - INAPPLICABLE"
                          }

interpretation["EDUCATION_RECODE_(EDITED)"]={-9: "NOT ASCERTAINED",
                             -8: "DK",
                             -7: "REFUSED",
                             -1: "INAPPLICABLE OR UNDER 5",
                             1: "LESS THAN/EQUAL TO 8TH GRADE",
                             2: "9 - 12TH GRADE, NO HS DIPLOMA OR GED",
                             13: "GED OR HS GRAD",
                             14: "BEYOND HS,COLLEGE(NO 4YR DEG),ASSOC DEG",
                             15: "4-YEAR COLLEGE DEGREE, BACHELOR'S DEGREE",
                             16: "MASTER'S, DOCTORATE, OR PROFESSIONAL DEG"
                             }

interpretation["INDUSTRY_GROUP_RD_3/1_CMJ"]={-9: "NOT ASCERTAINED",
                            -1: "INAPPLICABLE",
                            1: "NATURAL RESOURCES",
                            2: "MINING",
                            3: "CONSTRUCTION",
                            4: "MANUFACTURING",
                            5: "WHOLESALE AND RETAIL TRADE",
                            6: "TRANSPORTATION AND UTILITIES",
                            7: "INFORMATION",
                            8: "FINANCIAL ACTIVITIES",
                            9: "PROFESSIONAL AND BUSINESS SERVICES",
                            10: "EDUCATION, HEALTH, AND SOCIAL SERVICES",
                            11: "LEISURE AND HOSPITALITY",
                            12: "OTHER SERVICES",
                            13: "PUBLIC ADMINISTRATION",
                            14: "MILITARY",
                            15: "UNCLASSIFIABLE INDUSTRY"}

interpretation["OCCUPATION_GROUP_RD_3/1_CMJ"]={
    -9: "NOT ASCERTAINED",
    -1: "INAPPLICABLE",
    1: "MANAGEMENT, BUSINESS, AND FINANCIAL OPER",
    2: "PROFESSIONAL AND RELATED OCCUPATIONS", 
    3: "SERVICE OCCUPATIONS",
    4: "SALES AND RELATED OCCUPATIONS", 
    5: "OFFICE AND ADMINISTRATIVE SUPPORT",
    6: "FARMING, FISHING, AND FORESTRY",
    7: "CONSTRUCTION, EXTRACTION, AND MAINTENANC",
    8: "PRODUCTION, TRNSPORTATION, MATRL MOVING", 
    9: "MILITARY SPECIFIC OCCUPATIONS",
    11: "UNCLASSIFIABLE OCCUPATION"}

interpretation["CENSUS_REGION_AS_OF_12/31/13"]={-1: "Inapplicable",
                                                1:"Northeast",
                                                2:"Midwest",
                                                3:"South",
                                                4:"West"}

interpretation["EMPLOYMENT_STATUS_RD_3/1"]={-9: "NOT ASCERTAINED",
                                            -8: "DK",
                                            -7: "REFUSED",
                                            -1: "INAPPLICABLE",
                                            1: "EMPLOYED AT RD 3/1 INT DATE",
                                            2: "JOB TO RETURN TO AT RD 3/1 INT DATE",
                                            3: "JOB DURING RD 3/1 REF PERIOD",
                                            4: "NOT EMPLOYED DURING RD 3/1"}

interpretation["FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013"]={
     1: "<65 ANY PRIVATE",
     2: "<65 PUBLIC ONLY",
     3: "<65 UNINSURED",
     4: "65+ EDITED MEDICARE ONLY",
     5: "65+ EDITED MEDICARE AND PRIVATE",
     6: "65+ EDITED MEDICARE AND OTH PUB ONLY",
     7: "65+ UNINSURED",
     8: "65+ NO MEDICARE AND ANY PUBLIC/PRIVATE"
    }
In [5]:
def interpret_vectorized(data,feature):
    if feature in interpretation:
        return data[feature].map(lambda x: interpretation[feature][x])
    elif feature in categorical:
        return data[feature].map(lambda x: interpretation["categorical"][x])
    else:
        return data[feature].map(lambda x: "inapplicable" if x<0 else x)

Next I'm going to prototype a function that will spit out the empirical distributions of each of the explanatory variables.

I'm going to have to treat categorical variables differently to continuous variables.

One rough and ready way to break out the categorical variables is to check how many levels that feature has in the dataset - if there are relatively few (say below a threshold of 10 or 20), I'll assume the feature is categorical, and otherwise I'll assume it's continuous. To deal with some of the edge cases, it's easy to note that anything involving an age, or a # of visits should be continuous.

Let's see if this works.

In [6]:
def check_discrete(data,thevariable):
    if len(data[thevariable].unique())>19 or "AGE_OF" in thevariable or "AGE_AS" in thevariable or "#" in thevariable:
        return "continuous"
    else:
        return "categorical"

Let's see if this gave us what we wanted.

In [7]:
print("Continuous features\n\n")
continuous=[term for term in variables if check_discrete(data,term)=="continuous"]
for term in continuous:
    print(term)
print("\n\n\n")
print("Categorical features\n\n")
categorical=[term for term in variables if check_discrete(data,term)=="categorical"]
for term in categorical:
    print(term)
Continuous features


AGE_AS_OF_12/31/13_(EDITED/IMPUTED)
FAMILY'S_TOTAL_INCOME
ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3
AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE
AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE
AGE_OF_DIAGNOSIS-ANGINA
AGE_OF_DIAGNOSIS-HEART_ATTACK(MI)
AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE
AGE_OF_DIAGNOSIS-STROKE
AGE_OF_DIAGNOSIS-EMPHYSEMA
AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL
AGE_OF_DIAGNOSIS-DIABETES
AGE_OF_DIAGNOSIS-ARTHRITIS
AGE_OF_DIAGNOSIS-ASTHMA
AGE_OF_DIAGNOSIS-ADHD/ADD
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL
#_OFFICE-BASED_PROVIDER_VISITS_13
#_OUTPATIENT_DEPT_PROVIDER_VISITS_13
#_EMERGENCY_ROOM_VISITS_13
#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013
FINAL_PERSON_WEIGHT_2013
TOTAL_OFFICE-BASED_EXP_13
TOTAL_OUTPATIENT_PROVIDER_EXP_13
TOT_HOSP_IP_FACILITY_+_DR_EXP_13
TOTAL_ER_FACILITY_+_DR_EXP_13




Categorical features


SEX
RACE/ETHNICITY_(EDITED/IMPUTED)
MARITAL_STATUS-12/31/13_(EDITED/IMPUTED)
HIGH_BLOOD_PRESSURE_DIAG_(>17)
CORONARY_HRT_DISEASE_DIAG_(>17)
MULT_DIAG_HIGH_BLOOD_PRESS_(>17)
ANGINA_DIAGNOSIS_(>17)
HEART_ATTACK_(MI)_DIAG_(>17)
OTHER_HEART_DISEASE_DIAG_(>17)
STROKE_DIAGNOSIS_(>17)
EMPHYSEMA_DIAGNOSIS_(>17)
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R3/1
CHRONC_BRONCHITS_LAST_12_MTHS_(>17)-R5/3
HIGH_CHOLESTEROL_DIAGNOSIS_(>17)
CANCER_DIAGNOSIS_(>17)
CANCER_DIAGNOSED_-_BLADDER_(>17)
CANCER_DIAGNOSED_-_BREAST_(>17)
CANCER_DIAGNOSED_-_CERVIX_(>17)
CANCER_DIAGNOSED_-_COLON_(>17)
CANCER_DIAGNOSED_-_LUNG_(>17)
CANCER_DIAGNOSED_-_LYMPHOMA_(>17)
CANCER_DIAGNOSED_-_MELANOMA_(>17)
CANCER_DIAGNOSED_-_OTHER_(>17)
CANCER_DIAGNOSED_-_PROSTATE_(>17)
CANCER_DIAGNOSED_-_SKIN-NONMELANO_(>17)
CANCER_DIAGNOSED-SKIN-UNKNOWN_TYPE_(>17)
CANCER_DIAGNOSED_-_UTERUS_(>17)
DIABETES_DIAGNOSIS_(>17)
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_3/1
JOINT_PAIN_LAST_12_MONTHS_(>17)_-_RD_5/3
ARTHRITIS_DIAGNOSIS_(>17)
ASTHMA_DIAGNOSIS
DOES_PERSON_STILL_HAVE_ASTHMA-RD3/1
DOES_PERSON_STILL_HAVE_ASTHMA_-_RD_5/3
ASTHMA_ATTACK_LAST_12_MOS_-_RD3/1
USED_ACUTE_PRES_INHALER_LAST_3_MOS-RD5/3
USED>3ACUTE_CN_PRES_INH_LAST_3_MOS-RD5/3
EVER_USED_PREV_DAILY_ASTHMA_MEDS_-RD_5/3
NOW_TAKE_PREV_DAILY_ASTHMA_MEDS_-_RD_5/3
HAVE_PEAK_FLOW_METER_AT_HOME_-_RD_5/3
EVER_USED_PEAK_FLOW_METER_-_RD_5/3
ADHDADD_DIAGNOSIS_(5-17)
PREGNANT_DURING_REF_PERIOD_-_RD_3/1
LIMITATION_IN_PHYSICAL_FUNCTIONING-RD3/1
EMPLOYMENT_STATUS_RD_3/1
HAS_MORE_THAN_ONE_JOB_RD_3/1_INT_DATE
SELF-EMPLOYED_AT_RD_3/1_CMJ
CHOICE_OF_HEALTH_PLANS_AT_RD_3/1_CMJ
INDUSTRY_GROUP_RD_3/1_CMJ
UNION_STATUS_AT_RD_3/1_CMJ
OCCUPATION_GROUP_RD_3/1_CMJ
HEALTH_INSUR_HELD_FROM_RD_3/1_CMJ_(ED)
HEALTH_INSUR_OFFERED_BY_RD_3/1_CMJ_(ED)
EMPLOYER_OFFERS_HEALTH_INS_RD_3/1_CMJ
FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013
ANYONE_IN_RU_HAVE_FSA_-_R3/1
PRIVATE_INSURANCE_ANY_TIME_IN_R5/R3
PUBLIC_INS_ANY_TIME_IN_R5/R3
INSURED_ANY_TIME_IN_R3/1
ANY_TIME_COVERAGE_BY_STATE_INS_-_R3/1
CENSUS_REGION_AS_OF_12/31/13
EDUCATION_RECODE_(EDITED)

Pretty good!

Next I'll prototype a function to produce bar plots for the categorical variables. Since the data points are weighted, I need to make sure I scale each point out correctly before I aggregate under each label as well.

Here's the function:

In [8]:
def explore(data,thevar):
    temp=interpret_vectorized(data,thevar)
    w=data["FINAL_PERSON_WEIGHT_2013"]
    temp=pd.concat([temp,w],axis=1)
    temp["ones"]=1
    temp["scaled"]=temp["ones"]*temp["FINAL_PERSON_WEIGHT_2013"]
    result=temp.groupby([thevar])["scaled"].sum()
    ax=result.plot(kind="bar",title=thevar)
    ax.set_ylabel("Count")
    plt.show()

The categorical features above split up nicely into two groups that are interesting. The first consists of health status features, and the second consists of demographic and socioeconomic features. I'll group them accordingly:

In [9]:
health=[term for term in categorical if "DIAG" in term or 
        "LIMITATION" in term or
        "ASTHMA" in term or 
        "JOINT" in term or 
        "PEAK_FLOW" in term or
       "PREGNANT" in term or
       "ACUTE" in term or
       "BRONCH" in term]

demog=[term for term in categorical if term not in health]
In [10]:
for term in health:
    explore(data,term)

Here are a few observations about these health status features:

  1. High blood pressure, diabetes, high cholesterol, joint pain and arthritis are relatively common in the population.
  2. Heart disease (angina, heart attack, etc) and stroke are less prevalent, but still substantial.
  3. Ditto for lung disease (emphysema, bronchitis, asthma).
  4. While a cancer diagnosis is relatively common, the individual cancer diagnoses have very low mass in the "yes" bucket. This will likely pose a problem for estimating models, unless we either upsample, or otherwise penalize the other responses. Including these features in our model may be informative if different cancers affect expenditure in different ways, but we'll have to balance this against the lack of reliability of parameter estimates that we'll get from these features. Unless we have a good reason to believe that different cancers cost very different amounts to treat, we may be best off using the aggregate cancer diagnosis feature, and dropping the individual diagnosis features.
  5. Practically no-one uses a peak flow meter.
  6. There's relatively incidence of ADD/ADHD in the population.
  7. Relatively few people are pregnant at the time of the survey.
  8. General limitation in physical functioning is fairly prevalent.

Next let's look at the demographic and socio-economic features.

In [11]:
for term in demog:
    explore(data,term)

About the demographic/socio-economic features:

  1. The male/female subpopulations are evenly balanced in the population, with slightly more women than men.
  2. The race/ethnicity distribution is consistent with other surveys.
  3. The distribution estimates that about 70m poeple are under 16, a similar number has never been married, about 130m people are married.
  4. About 150m people were employed when the survey was conducted. About 100m people are not employed (not in the labor force, or unemployed). This is consistent with reported data elsewhere. About 20m people are self employed, and about 10m people hold more than one job.
  5. Remarkably few people - less than 50m - have a choice of health plans at work. I'm interested in whether people who have a choice of health plans tend to pay less for healthcare on average, assuming there is any effect coming from encouraging people to think about their purchasing choices on subsequent utilization or on prices paid/negotiated.
  6. The modal entry for industry group and occupational group (by far) is inapplicable. This is a constructed feature in the dataset, so I assume this wasn't directly asked of respondents. It's hard to say what this might mean, or if it'll be useful to keep in our model.
  7. About 10m people are in a union. It would also be interesting to know whether being part of a union affects utilization or expenditure in any way.
  8. About 60m people did not hold health insurance at the time. These numbers are consistent with pre-ACA uninsured levels.
  9. About 80m people were offered health insurance through their employers, and 40m were not.
  10. About 180m people under 65 had private insurance coverage during the survey, about 50m people under 65 had only public insurance options, and about 50m people under 65 were uninsured. There were about 20m people over 65 holding a mix of private insurance and medicare, and a similar number with only medicare coverage.
  11. About 30m people lived in a household where someone had a flexible spending account. This is about $10\%$ of the population, but FSAs (especially coupled with high-deductible plans) are an interesting demand-side innovation and it would be interesting to understand if this plays a role in determining expenditure.
  12. A negligible number of people (relatively speaking) had coverage under a state insurance plan.
  13. The distribution of people over census regions is approximately as follows: about 120m in the south, about 75m people in the west, about 65m people in the midwest, and about 60m people in the northeast.
  14. The distibution over educational attainment is as follows: about 70m people have an associate's degree past high school, about 65m have a GED or high school diploma, a little under 60m have an 8th grade qualification or less, about 30m people stopped high school between 9th and 12th grade and do not have a high school diploma, a little over 40m people have a 4-year degree, about 25m people have a masters, doctoral or professional degree, and about 25m people fall into the inapplicable bucket.

Now that we've had a look at the categorical features, let's take a look at the continuous features.

In [12]:
continuous
Out[12]:
['AGE_AS_OF_12/31/13_(EDITED/IMPUTED)',
 "FAMILY'S_TOTAL_INCOME",
 'ADULT_BODY_MASS_INDEX_(>17)_-_RD_5/3',
 'AGE_OF_DIAGNOSIS-HIGH_BLOOD_PRESSURE',
 'AGE_OF_DIAGNOSIS-CORONARY_HEART_DISEASE',
 'AGE_OF_DIAGNOSIS-ANGINA',
 'AGE_OF_DIAGNOSIS-HEART_ATTACK(MI)',
 'AGE_OF_DIAGNOSIS-OTHER_HEART_DISEASE',
 'AGE_OF_DIAGNOSIS-STROKE',
 'AGE_OF_DIAGNOSIS-EMPHYSEMA',
 'AGE_OF_DIAGNOSIS-HIGH_CHOLESTEROL',
 'AGE_OF_DIAGNOSIS-DIABETES',
 'AGE_OF_DIAGNOSIS-ARTHRITIS',
 'AGE_OF_DIAGNOSIS-ASTHMA',
 'AGE_OF_DIAGNOSIS-ADHD/ADD',
 '#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL',
 '#_OFFICE-BASED_PROVIDER_VISITS_13',
 '#_OUTPATIENT_DEPT_PROVIDER_VISITS_13',
 '#_EMERGENCY_ROOM_VISITS_13',
 '#_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013',
 'FINAL_PERSON_WEIGHT_2013',
 'TOTAL_OFFICE-BASED_EXP_13',
 'TOTAL_OUTPATIENT_PROVIDER_EXP_13',
 'TOT_HOSP_IP_FACILITY_+_DR_EXP_13',
 'TOTAL_ER_FACILITY_+_DR_EXP_13']

There are broadly speaking, continuous variables in the following categories: ages, income levels, time since some event, number of occurences of an event, and total expenditures.

Let's group these variables:

In [13]:
age=[term for term in continuous if "AGE_AS" in term]
age_diagnosed=[term for term in continuous if "AGE_OF" in term]
bmi=[term for term in continuous if "ADULT" in term and "BODY" in term and "MASS" in term]
incomes=[term for term in continuous if "FAMILY" in term and "INCOME" in term]
durations=['#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL']
utilizations=[term for term in continuous if "#" in term and "WKS" not in term]
expenditures=['TOTAL_OFFICE-BASED_EXP_13',
              'TOTAL_OUTPATIENT_PROVIDER_EXP_13',
              'TOT_HOSP_IP_FACILITY_+_DR_EXP_13',
              'TOTAL_ER_FACILITY_+_DR_EXP_13'
             ]
    
    

Let's look at the ages first.

The documentation only mentions that the value $-1$ is recorded for age when there is no valid data collected and no imputation can reasonably be made. In the following graph I'll censor this level, since there's not really a good way to interpret this that I can think of. I will probably also leave it out when building the model.

In [14]:
for term in age:
    bins=sorted(data[term][data[term]>=0].unique())
    ax=data[term][data[term]>=0].plot(kind="hist",bins=bins,title=term,xticks=range(min(bins),max(bins),5), 
                                      weights=data["FINAL_PERSON_WEIGHT_2013"][data[term]>=0])
    ax.set_ylabel("Count")
    plt.show()

The first thing to notice is the huge mass at 85. There aren't actually over 6m people in the country of age exactly 85; the data is top-coded at this value to preserve respondent privacy. This means I won't be able to build a model that discriminates on age past 85, and I'll just have to lump everyone older than that into the same bucket. I'll hopefully be able to get a reasonable cross section of that subpopulation using other features.

Next let's look at the age of diagnosis features. Since most people don't have chronic illnesses of these types, the mass at -1 dominates the distribution. Accordingly, I'm going to censor those points in these plots, so each of these distributions are conditional on having that chronic condition.

Here's a modified function from the last post for producing histograms with weighted data.

In [15]:
def zoom_hist(thevar, therange,bin_width=100, xticks=None):
    '''wrapper around df.hist'''
    temp=data[[thevar]+["FINAL_PERSON_WEIGHT_2013"]]
    bins=range(therange[0],therange[1],bin_width)
    ax=temp[thevar].hist(bins=bins,range=therange, weights=temp["FINAL_PERSON_WEIGHT_2013"])
    ax.set_title("%s between %s and %s, bin width is %s"%(thevar,therange[0],therange[1],bin_width))
    ax.set_xlabel(thevar)
    ax.set_ylabel("Count")
    if xticks is not None:
        ax.set_xticks(xticks)
        ax.set_xticklabels(xticks, rotation="vertical")
    plt.show()
In [16]:
for term in age_diagnosed:
    zoom_hist(term,therange=[1,85],bin_width=1)
  1. The "lifestyle" diseases: heart attack, angina, high blood pressure, coronary heart disease, stroke, emphysema, high cholesterol, and diabetes all follow very similar distributions, which are all not-quite normal, with modes in the 50-60 range, and with some right skew.
  2. It will be interesting to look at correlations between these lifestyle disease features. I don't know a priori if this should be high or low, but I would imagine that having multiple comorbidities would be predictive of higher expenditure on healthcare.
  3. Arthritis follows a similar distribution to the lifestyle diseases.
  4. Asthma is very heavily skewed left, and is most diagnosed at very young ages, with the diagnosis rate trailing off as the population gets older.
  5. The age of diagnosis for ADD/ADHD is very tightly clustered in the 0-20 range, with a bell shaped distribution, non-existent tails, and some skew to the left.
  6. A generally interesting hypothesis about the relationship between expenditure and the age of diagnosis of any of these diseases may be that the earlier the diagnosis, the better the patient is at making lifestyle changes that can mitigate expenditure levels. It might be interesting to see if this holds up in the modeling.

Let's look at BMI.

In [17]:
for term in bmi:
    zoom_hist(term,therange=[0,100],bin_width=1)

The BMI (body mass index) is computed by dividing a person's weight in pounds by their squared height in square inches and multiplying by 703.

Here's how to interpret the adult body mass index according to the CDC guidelines for interpretation.

The distribution of BMI in the population is estimated by this histogram to be bell-shaped, with a heavier right tail, and centered around 25 or so.

Next, on to the income variable:

In [18]:
for term in incomes:
    print(data[term].describe())
count     36940.000000
mean      58199.456984
std       56276.325042
min     -258220.000000
25%       20000.000000
50%       41207.000000
75%       79400.000000
max      543051.000000
Name: FAMILY'S_TOTAL_INCOME, dtype: float64

What's really weird about this is that I see responses below 0; far below 0!

Let's take a closer look.

In [19]:
data[incomes+expenditures][data[incomes[0]]<0]
Out[19]:
FAMILY'S_TOTAL_INCOME TOTAL_OFFICE-BASED_EXP_13 TOTAL_OUTPATIENT_PROVIDER_EXP_13 TOT_HOSP_IP_FACILITY_+_DR_EXP_13 TOTAL_ER_FACILITY_+_DR_EXP_13
54 -258220 1467 9 0 534
55 -258220 0 0 0 0
4843 -3005 0 0 0 0
4844 -3005 37 0 0 622
8200 -1500 2815 0 0 1146
27520 -2196 0 0 0 0

Thankfully there are not many data points here. An important question is how to interpret an income of $-258220$. The smaller values for negative incomes are also interesting - are these reported after something like adjusting for public assistance by the respondents?

Let's take a look at the survey documentation.

FAMINC13 contains total family income for each person’s CPS family. Family income was derived by constructing person-level total income comprising annual earnings from wages, salaries, bonuses, tips, commissions; business and farm gains and losses; unemployment and workers’ compensation; interest and dividends; alimony, child support, and other private cash transfers; private pensions, IRA withdrawals, social security, and veterans payments; supplemental security income and cash welfare payments from public assistance, Temporary Assistance for Needy Families, and related programs; gains or losses from estates, trusts, partnerships, S corporations, rent, and royalties; and a small amount of “other” income. Person-level income excluded tax refunds and capital gains. Person-level income totals were then summed over family members, as defined by CPSFAMID, to yield CPS family-level total income (FAMINC13). 

Looking at the expenditure features, there's nothing particularly outlandish about these records apart from the incomes. Given this, I think I'm inclined to go by: "when in doubt, throw it out" when it comes to these data points.

Let's truncate at 0 and take a look at the shape of the distribution.

In [20]:
for term in incomes:
    zoom_hist(term,therange=[0,1000000],bin_width=10000,xticks=range(0,1000000,50000))

The distribution is unimodal (mode around $\$50,000$), and with most of the mass concentrated between $\$0$ and $\$200,000$, and a thin (but very well off!) tail past that point.

Let's move on to the durations variables.

In [21]:
data[durations].describe()
Out[21]:
#_WKS/MON_WOUT_HLTH_INS_PRV_YR-PN_18_ONL
count 36940.000000
mean -0.899729
std 1.303947
min -9.000000
25% -1.000000
50% -1.000000
75% -1.000000
max 52.000000
In [22]:
data[durations[0]].value_counts()
Out[22]:
-1     36508
 12       81
 6        49
 4        42
 3        37
 2        36
 10       24
 5        24
 8        24
 1        21
 7        20
 9        14
 11       14
-8        11
 0        11
 52        7
 18        5
 40        4
-9         2
 26        2
 30        1
 43        1
 36        1
-7         1
dtype: int64

Since the majority of respondents did carry insurance in the previous year, I'll drop the points at -1 to get a sense of the distribution conditional on not being insured.

In [23]:
for term in durations:
    zoom_hist(term,therange=[0,60],bin_width=1)

This is quite interesting. Here's what the documentation has to say about this feature:

For persons who were covered by health insurance on January 1st, it was ascertained if they were ever without health insurance in the previous year (NOINSBEF). The number of weeks/months without health insurance was also ascertained (NOINSTM, NOINUNIT). 

I'm interested to know if this feature might be predictive of expenditure. It's hard to say - going a long time without insurance might be reflective of being unemployed, or if the respondent did have the means to buy insurance, may be a proxy for their (lack of) demand for risk mitigation. If either of these correlations hold, this might say something about the effect of either bargaining position, or risk tolerance, on propensity to spend.

Let's take a look at this at only positive values.

Next, the utilization features:

In [24]:
for term in utilizations:
    zoom_hist(term,therange=[0,500],bin_width=5)

It seems clear that most people use relatively little healthcare. Let's confirm if a lot of these utilizations are at 0.

In [ ]:
for term in utilizations:
    print(data[term].describe(),"\n")
count    36940.000000
mean         4.454548
std          9.982592
min          0.000000
25%          0.000000
50%          1.000000
75%          5.000000
max        277.000000
Name: #_OFFICE-BASED_PROVIDER_VISITS_13, dtype: float64 

count    36940.000000
mean         0.353952
std          2.373837
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        171.000000
Name: #_OUTPATIENT_DEPT_PROVIDER_VISITS_13, dtype: float64 

count    36940.000000
mean         0.203303
std          0.633602
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max         24.000000
Name: #_EMERGENCY_ROOM_VISITS_13, dtype: float64 

count    36940.000000
mean         0.408663
std          3.817236
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        209.000000
Name: #_NIGHTS_IN_HOSP_FOR_DISCHARGES_2013, dtype: float64 

That's definitely the case. Let's see what the distributions look like conditional on positive utilization.

In [ ]:
for term in utilizations:
    zoom_hist(term,therange=[1,60],bin_width=1)

These distributions all look very similar; the modal positive utilization level is one visit, and utilization rapidly decays, although there are very heavy tails for office based services and inpatient services (we've noticed this already in the expenditure features for these categories).

The main difference between these categories is the degree of use. For example, the number of people in the modal office visit level is over 50m, for outpatient and ER is about 30m, and for inpatient nights is about 5m.

Similar pictures in this case.

We've already looked at the patterns in expenditure, so this takes us to the end of the continuous variables we want to look at.

The next thing I'd like to do is to take a closer look at the features to check for bivariate relationships. Do specific diagnoses determine higher levels of expenditure? Do people who make more money pay more for healthcare? If there is such a relationship, is it linear? Is the growth faster or slower than linear? Do certain diagnoses, while not individually affecting expenditure much, combine to have a large effect, perhaps larger than the sum of their individual pieces?

Looking more closely at the data will, I hope, help me validate or refute interesting hypotheses like these, and possibly generate many more. The first step is to try and visualize the possible relationships, which will be in the next post.

Click to read and post comments

Dec 22, 2015

Looking at the dataset - yet more cleaning, looking at the predicted variables

In the last post, I managed to construct the MEPS dataset as a csv file. Here's the result, for reference. (Warning: the file is large: about 250MB.)

The next thing to do is to take a look at the dataset, and see if we need to do any more processing. Since we have it written to disk as a csv file, I can import it using pandas' read_csv function. I may have to do a bit of plotting when I explore the dataset, so I'll go ahead and load all the packages I imagine I'll need for this.

Here's the boilerplate:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.pylab.rcParams['figure.figsize'] = (10, 6)

Let's load the dataset as a pandas dataframe and see what it contains:

In [2]:
cd /home/soumya/research/insight/insight_project/data/
/home/soumya/research/insight/insight_project/data
In [ ]:
ls
2013_consolidated_frame.csv     dsm_data_cleaned_columns.pkl
2013_test_with_dummies.pkl      dsm_data.csv
2013_train_with_dummies.pkl     dsm_data.pkl
2013_validate_with_dummies.pkl  dsm_data_scoped_variables.pkl
2013_with_dummies_.pkl          Untitled.ipynb
2013_with_dummies.pkl
In [ ]:
data=pd.read_csv("./2013_consolidated_frame.csv")

First let's take a look at what features we have:

In [ ]:
data.columns

The first thing I get from this is that I need to learn what these features actually are, so I'll have to look up the MEPS survey methodology or documentation in order to figure out what these features actually are. There are 1792 features for each record, so this may be a bit involved.

The second thing I get out of looking at the column names is that they really do need a bit of cleanup. For example, the first feature is called "DWELLING UNIT ID", but the second is called " PERSON NUMBER" (note the extra space at the beginning). This seems to be a running problem in the column names. There's also a column right at the end with no name.

There are 1792 column names to fix, which is entirely too many to fix by hand.

On the other hand, it looks like the cleanup consists of a couple of simple rules:

Remove the initial space in the feature name
Pop the last entry from the column names

Let's write a function to do the first job:

In [ ]:
def clean_columns(theframe):
    temp=theframe.columns.map(lambda x: (((str(x).lstrip()).rstrip()))).copy()
    theframe.columns=temp
    return theframe

Let's try this on our dataframe and see if it does the first task.

In [ ]:
data1=clean_columns(data)
data1.head()

So far so good; now I just need to get rid of the last column.

In [ ]:
data1=data1.drop("",axis=1)
data1.head()

Great, that worked. Alright, let's roll that into the clean_columns function:

In [ ]:
def clean_columns(theframe):
    temp=theframe.columns.map(lambda x: (((str(x).lstrip()).rstrip()))).copy()
    theframe.columns=temp
    theframe=theframe.drop("",axis=1)
    return theframe
In [ ]:
data=clean_columns(data)
In [ ]:
data.columns

Perfect.

One thing worth remarking on is that I've actually extracted the variable description rather than the variable name in this step. This was a deliberate decision, as I find it much more intuitive to work with variable names like "MOST RECENT COLONOSCOPY (>39) - R5/3" rather than the variable name "CLNTST53". Those of you familiar with the constraints of variable names in SQL databases will object that this is going to cause trouble in the future. I decided to go with the more descriptive variable names, and write a coding-decoding function to help move data in and out of SQL databases a bit later in the project. Overall, I found this to be a reasonable choice.

To this end, let me go ahead and get rid of the spaces in the variables names before that causes any bugs down the road.

In [ ]:
def underscore_the_headers(data):
    temp=(data.columns).copy()
    result=[term.replace(' ','_') for term in temp]
    data.columns=result.copy()
    return data

Let's run the data columns through this function:

In [ ]:
data=underscore_the_headers(data)
In [ ]:
data.columns

Great!

Next, I'd like to learn a bit more about what these features are. The main thing to keep in mind is that having a large number of features will make our model very unwieldy.

The crucial thing for this model is for it to be as useful as possible, and if I have to collect 1700 features or so about a person before I can give them any useful feedback, this model won't be used by anyone.

Secondly, for the purposes of my project, I'm most concerned with how the demographic, economic, or health status of a consumer drives cost, since these are the things that can inform how we choose to design the market for healthcare services.

To this end, let's take a look at what's available at the MEPS page that points to the dataset. This page gives us two helpful pieces of information: the first is the documentation for the dataset and the second is the codebook, which we've already used in constructing the csv file.

While the codebook contains the entire listing of variables in the dataset and the start-end points in the ASCII file, it does not tell us what values the variables actually take. To be fair, the variable names are fairly descriptive so we might be able to make educated guesses here, but a quick query of one of the columns in the dataset we've constructed:

In [ ]:
data["HIGH_CHOLESTEROL_DIAGNOSIS_(>17)"].describe()

raises the question: what do the values of this variable mean? It has a max of 2, and a min of -9. That doesn't correspond to any prior I currently have over the values of a diagnosis of high cholesterol. I'd expect to see things like positive, negative, not tested, and so on. Let's take a more detailed look:

In [ ]:
data["HIGH_CHOLESTEROL_DIAGNOSIS_(>17)"].value_counts()

The most frequently populated values are 2, 1 and -1, but there are also records with values -8, -7 and -9.

This warrants a closer look at the documentation. A quick search for "cholesterol" turns up:

High Cholesterol

Questions about high cholesterol were asked of persons aged 18 or older. Consequently, persons aged 17 or younger were coded as “Inapplicable” (-1) on these variables. CHOLDX ascertained whether the person had ever been diagnosed as having high cholesterol. Through 2007, a person-level variable (CHLAGE) indicated the age of diagnosis for high cholesterol on the Person-Level Use PUF. The age of diagnosis for high cholesterol (CHOLAGED) is included in this file. This variable is top-coded to 85 years of age.



which is somewhat helpful, but doesn't explain the use of the values -8 and -9.

Further digging into the documentation yields the following table:

2.2 Reserved Codes

The following reserved code values are used:

Value     Definition
-1 INAPPLICABLE     Question was not asked due to skip pattern
-2 DETERMINED IN PREVIOUS ROUND     Question was not asked in round because there was no change in current main job since previous round
-7 REFUSED     Question was asked and respondent refused to answer question
-8 DK     Question was asked and respondent did not know answer
-9 NOT ASCERTAINED     Interviewer did not record the data
-10 HOURLY WAGE >= $76.96     Hourly wage was top-coded for confidentiality
-13 INITIAL WAGE IMPUTED     Hourly wage was previously imputed so an updated wage is not included in this file

This is helpful; it indicates that the most relevant levels for our analysis ought to be the following three:

  1. Positive diagnosis of high cholesterol
  2. Negative diagnosis of high cholesterol
  3. Inapplicable to respondent

and given the low population in levels below -1, we can either aggregate these data points with level==-1, or discard them entirely. In the following analysis, we choose to discard them.

A final question that needs to be asked at this step is which level corresponds to a positive diagnosis of high cholesterol? A close look at the documentation indicates that level==1 corresponds to having the condition, and level==2 corresponds to not having the condition. Knowing which level is which doesn't particularly affect the computations that generate our models going forward, but it is important for being able to interpret the models.

As a quick sanity check, let's get a visualization of the distribution of the data between these levels:

In [ ]:
data["HIGH_CHOLESTEROL_DIAGNOSIS_(>17)"].hist(bins=100, weights=data["FINAL_PERSON_WEIGHT_2013"])

Of the respondents for whom the question was applicable, it looks like the ratio of level==1 to level==2 is 7/17 or about 2/5. This is close to the CDC estimates of the prevalence of high cholesterol amongst American adults: http://www.cdc.gov/dhdsp/data_statistics/fact_sheets/fs_cholesterol.htm .

Next, instead of using level==1 or level==2 to describe the values that a categorical feature takes, I'd like to translate the levels of these categorical variables into their descriptions so my plots are easier to read.

Here's a dictionary that implements this using the documentation quoted above:

In [ ]:
interpretation={}

interpretation["categorical"]={
                1: "Yes",
                2: "No",
                -1: "Inapplicable",
                -2: "Determined in previous round",
                -7: "Refused",
                -8: "Don't know",
                -9: "Not ascertained",
                -10: "HOURLY WAGE >= $76.96",
                -13: "INITIAL WAGE IMPUTED"
                }

interpretation["RACE/ETHNICITY_(EDITED/IMPUTED)"]={1: "HISPANIC",
                        2: "NON-HISPANIC WHITE ONLY",
                        3: "NON-HISPANIC BLACK ONLY",
                        4: "NON-HISPANIC ASIAN ONLY",
                        5: "NON-HISPANIC OTHER RACE OR MULTIPLE RACE"
                        }

interpretation["SEX"]={1: "MALE", 2: "FEMALE"}

interpretation["MARITAL_STATUS-12/31/13_(EDITED/IMPUTED)"]={-9: "NOT ASCERTAINED",
                           -8: "DK",
                           -7: "REFUSED",
                           1: "MARRIED",
                           2: "WIDOWED",
                           3: "DIVORCED",
                           4: "SEPARATED",
                           5: "NEVER MARRIED",
                           6: "UNDER 16 - INAPPLICABLE"
                          }

interpretation["EDUCATION_RECODE_(EDITED)"]={-9: "NOT ASCERTAINED",
                             -8: "DK",
                             -7: "REFUSED",
                             -1: "INAPPLICABLE OR UNDER 5",
                             1: "LESS THAN/EQUAL TO 8TH GRADE",
                             2: "9 - 12TH GRADE, NO HS DIPLOMA OR GED",
                             13: "GED OR HS GRAD",
                             14: "BEYOND HS,COLLEGE(NO 4YR DEG),ASSOC DEG",
                             15: "4-YEAR COLLEGE DEGREE, BACHELOR'S DEGREE",
                             16: "MASTER'S, DOCTORATE, OR PROFESSIONAL DEG"
                             }

interpretation["INDUSTRY_GROUP_RD_3/1_CMJ"]={-9: "NOT ASCERTAINED",
                            -1: "INAPPLICABLE",
                            1: "NATURAL RESOURCES",
                            2: "MINING",
                            3: "CONSTRUCTION",
                            4: "MANUFACTURING",
                            5: "WHOLESALE AND RETAIL TRADE",
                            6: "TRANSPORTATION AND UTILITIES",
                            7: "INFORMATION",
                            8: "FINANCIAL ACTIVITIES",
                            9: "PROFESSIONAL AND BUSINESS SERVICES",
                            10: "EDUCATION, HEALTH, AND SOCIAL SERVICES",
                            11: "LEISURE AND HOSPITALITY",
                            12: "OTHER SERVICES",
                            13: "PUBLIC ADMINISTRATION",
                            14: "MILITARY",
                            15: "UNCLASSIFIABLE INDUSTRY"}

interpretation["OCCUPATION_GROUP_RD_3/1_CMJ"]={
    -9: "NOT ASCERTAINED",
    -1: "INAPPLICABLE",
    1: "MANAGEMENT, BUSINESS, AND FINANCIAL OPER",
    2: "PROFESSIONAL AND RELATED OCCUPATIONS", 
    3: "SERVICE OCCUPATIONS",
    4: "SALES AND RELATED OCCUPATIONS", 
    5: "OFFICE AND ADMINISTRATIVE SUPPORT",
    6: "FARMING, FISHING, AND FORESTRY",
    7: "CONSTRUCTION, EXTRACTION, AND MAINTENANC",
    8: "PRODUCTION, TRNSPORTATION, MATRL MOVING", 
    9: "MILITARY SPECIFIC OCCUPATIONS",
    11: "UNCLASSIFIABLE OCCUPATION"}

interpretation["CENSUS_REGION_AS_OF_12/31/13"]={-1: "Inapplicable",
                                                1:"Northeast",
                                                2:"Midwest",
                                                3:"South",
                                                4:"West"}

interpretation["EMPLOYMENT_STATUS_RD_3/1"]={-9: "NOT ASCERTAINED",
                                            -8: "DK",
                                            -7: "REFUSED",
                                            -1: "INAPPLICABLE",
                                            1: "EMPLOYED AT RD 3/1 INT DATE",
                                            2: "JOB TO RETURN TO AT RD 3/1 INT DATE",
                                            3: "JOB DURING RD 3/1 REF PERIOD",
                                            4: "NOT EMPLOYED DURING RD 3/1"}

interpretation["FULL_YEAR_INSURANCE_COVERAGE_STATUS_2013"]={
     1: "<65 ANY PRIVATE",
     2: "<65 PUBLIC ONLY",
     3: "<65 UNINSURED",
     4: "65+ EDITED MEDICARE ONLY",
     5: "65+ EDITED MEDICARE AND PRIVATE",
     6: "65+ EDITED MEDICARE AND OTH PUB ONLY",
     7: "65+ UNINSURED",
     8: "65+ NO MEDICARE AND ANY PUBLIC/PRIVATE"
    }

Let's write a function that implements this as necessary for a feature.

The features whose levels I need to interpret in this way are all categorical features; the documentation indicates that the continuous variables aren't generally coded (apart from top-coding to preserve privacy). Accordingly, it would help to be able to automatically determine whether the feature I'm looking at is categorical or continuous. I'll try implementing the following rough rule of thumb: if a feature has less than 10-20 levels, I'll assume it's categorical, and otherwise I'll assume it's continuous.

In [ ]:
def check_discrete(data,thevariable):
    if len(data[thevariable].unique())>19 or "AGE_OF" in thevariable or "AGE_AS" in thevariable or "#" in thevariable:
        return "continuous"
    else:
        return "categorical"

Let's classify the features in the dataset:

In [ ]:
continuous=[term for term in data.columns if check_discrete(data,term)=="continuous"]
categorical=[term for term in data.columns if check_discrete(data,term)=="categorical"]

Finally, I'll write a function that interprets the levels of the features in any categorical variable:

In [ ]:
def interpret_vectorized(data,feature):
    if feature in interpretation:
        return data[feature].map(lambda x: interpretation[feature][x])
    elif feature in categorical:
        return data[feature].map(lambda x: interpretation["categorical"][x])
    else:
        return data[feature].map(lambda x: "inapplicable" if x<0 else x)

Here's an example of how it transforms the levels in a plot:

In [ ]:
thevar='HIGH_CHOLESTEROL_DIAGNOSIS_(>17)'
In [ ]:
def explore(data,thevar):
    temp=interpret_vectorized(data,thevar)
    w=data["FINAL_PERSON_WEIGHT_2013"]
    temp=pd.concat([temp,w],axis=1)
    temp["ones"]=1
    temp["scaled"]=temp["ones"]*temp["FINAL_PERSON_WEIGHT_2013"]
    result=temp.groupby([thevar])["scaled"].sum()
    ax=result.plot(kind="bar",title=thevar)
    ax.set_ylabel("Count")
    plt.show()
In [ ]:
explore(data,thevar)

That visualization was a nice way of learning something about the data, and confirming a hunch we had about how to interpret it. In fact, simple visual plots of different features in the dataset are a crucial way of getting some initial insight into the data, and helping us prioritize the most fruitful approaches for modeling the data.

To fix ideas, let's scope down what we want to go into our model.

Explicitly, we'd like to be able to answer the following questions: which (if any) of the following features about a person determine what they will pay for healthcare in a year?

  1. Demographic information:
    1. Age
    2. Sex
    3. Race/Ethnicity
    4. Marital Status
    5. Educational attainment
    6. Census region
  2. Income information (these are all top-coded in the dataset for confidentiality reasons):
  3. Idiosyncratic health status variables:
    1. High blood pressure, including multiple diagnoses
    2. Heart disease (including coronary heart disease, angina, myocardial infarction, and other unspecified heart disease)
    3. Stroke
    4. Emphysema
    5. Chronic Bronchitis
    6. High cholesterol, including the age of diagnosis
    7. Cancer
    8. Diabetes
    9. Joint pain
    10. Arthritis
    11. Asthma
    12. Attention Deficit Hyperactivity Disorder/Attention Deficit Disorder (ADHD/ADD)
  4. Adult Body Mass Index
  5. Health Insurance Variables
    1. Public Insurance Indicators
    2. Private Insurance Indicators
    3. Any Insurance Indicators
    4. HMO plan
  6. Utilization, expenditure and source of payments information

It's fair to say that this is a lot of information to process. One thing that stands out here is that there are a great number of features already recorded for us in this dataset; this is in sharp contrast to some projects I've done using text and/or social media data, in which the hard part of the modeling process is feature construction in some appropriate way.

Since there is such a rich variety of features, I imagine most of the feature engineering in this project will amount to bucketing variables and levels in appropriate ways to optimize explained variation, rather than constructing new features from scratch.

To this end, let's make a list of the exact features we want to start working with.

  1. AGE AS OF 12/31/13 (EDITED/IMPUTED)
  2. SEX
  3. RACE/ETHNICITY (EDITED/IMPUTED)
  4. MARITAL STATUS-12/31/13 (EDITED/IMPUTED)
  5. FAMILY'S TOTAL INCOME
  6. ADULT BODY MASS INDEX (>17) - RD 5/3
  7. HIGH BLOOD PRESSURE DIAG (>17)
  8. CORONARY HRT DISEASE DIAG (>17)
  9. AGE OF DIAGNOSIS-HIGH BLOOD PRESSURE
  10. MULT DIAG HIGH BLOOD PRESS (>17)
  11. AGE OF DIAGNOSIS-CORONARY HEART DISEASE
  12. ANGINA DIAGNOSIS (>17)
  13. AGE OF DIAGNOSIS-ANGINA
  14. HEART ATTACK (MI) DIAG (>17)
  15. AGE OF DIAGNOSIS-HEART ATTACK(MI)
  16. OTHER HEART DISEASE DIAG (>17)
  17. AGE OF DIAGNOSIS-OTHER HEART DISEASE
  18. STROKE DIAGNOSIS (>17)
  19. AGE OF DIAGNOSIS-STROKE
  20. EMPHYSEMA DIAGNOSIS (>17)
  21. AGE OF DIAGNOSIS-EMPHYSEMA
  22. CHRONC BRONCHITS LAST 12 MTHS (>17)-R3/1
  23. CHRONC BRONCHITS LAST 12 MTHS (>17)-R5/3
  24. HIGH CHOLESTEROL DIAGNOSIS (>17)
  25. AGE OF DIAGNOSIS-HIGH CHOLESTEROL
  26. CANCER DIAGNOSIS (>17)
  27. CANCER DIAGNOSED - BLADDER (>17)
  28. CANCER DIAGNOSED - BREAST (>17)
  29. CANCER DIAGNOSED - CERVIX (>17)
  30. CANCER DIAGNOSED - COLON (>17)
  31. CANCER DIAGNOSED - LUNG (>17)
  32. CANCER DIAGNOSED - LYMPHOMA (>17)
  33. CANCER DIAGNOSED - MELANOMA (>17)
  34. CANCER DIAGNOSED - OTHER (>17)
  35. CANCER DIAGNOSED - PROSTATE (>17)
  36. CANCER DIAGNOSED - SKIN-NONMELANO (>17)
  37. CANCER DIAGNOSED-SKIN-UNKNOWN TYPE (>17)
  38. CANCER DIAGNOSED - UTERUS (>17)
  39. DIABETES DIAGNOSIS (>17)
  40. AGE OF DIAGNOSIS-DIABETES
  41. JOINT PAIN LAST 12 MONTHS (>17) - RD 3/1
  42. JOINT PAIN LAST 12 MONTHS (>17) - RD 5/3
  43. ARTHRITIS DIAGNOSIS (>17)
  44. AGE OF DIAGNOSIS-ARTHRITIS
  45. ASTHMA DIAGNOSIS
  46. AGE OF DIAGNOSIS-ASTHMA
  47. DOES PERSON STILL HAVE ASTHMA-RD3/1
  48. DOES PERSON STILL HAVE ASTHMA - RD 5/3
  49. ASTHMA ATTACK LAST 12 MOS - RD3/1
  50. USED ACUTE PRES INHALER LAST 3 MOS-RD5/3
  51. USED>3ACUTE CN PRES INH LAST 3 MOS-RD5/3
  52. EVER USED PREV DAILY ASTHMA MEDS -RD 5/3
  53. NOW TAKE PREV DAILY ASTHMA MEDS - RD 5/3
  54. HAVE PEAK FLOW METER AT HOME - RD 5/3
  55. EVER USED PEAK FLOW METER - RD 5/3
  56. ADHDADD DIAGNOSIS (5-17)
  57. AGE OF DIAGNOSIS-ADHD/ADD
  58. PREGNANT DURING REF PERIOD - RD 3/1
  59. LIMITATION IN PHYSICAL FUNCTIONING-RD3/1
  60. EMPLOYMENT STATUS RD 3/1
  61. HAS MORE THAN ONE JOB RD 3/1 INT DATE
  62. SELF-EMPLOYED AT RD 3/1 CMJ
  63. CHOICE OF HEALTH PLANS AT RD 3/1 CMJ
  64. INDUSTRY GROUP RD 3/1 CMJ
  65. UNION STATUS AT RD 3/1 CMJ
  66. OCCUPATION GROUP RD 3/1 CMJ
  67. HEALTH INSUR HELD FROM RD 3/1 CMJ (ED)
  68. HEALTH INSUR OFFERED BY RD 3/1 CMJ (ED)
  69. EMPLOYER OFFERS HEALTH INS RD 3/1 CMJ
  70. FULL YEAR INSURANCE COVERAGE STATUS 2013
  71. ANYONE IN RU HAVE FSA - R3/1
  72. # WKS/MON WOUT HLTH INS PRV YR-PN 18 ONL
  73. PRIVATE INSURANCE ANY TIME IN R5/R3
  74. PUBLIC INS ANY TIME IN R5/R3
  75. INSURED ANY TIME IN R3/1
  76. ANY TIME COVERAGE BY STATE INS - R3/1
  77. # OFFICE-BASED PROVIDER VISITS 13
  78. # OUTPATIENT DEPT PROVIDER VISITS 13
  79. # EMERGENCY ROOM VISITS 13
  80. # NIGHTS IN HOSP FOR DISCHARGES 2013
  81. FINAL PERSON WEIGHT 2013
  82. TOTAL OFFICE-BASED EXP 13
  83. TOTAL OUTPATIENT PROVIDER EXP 13
  84. TOT HOSP IP FACILITY + DR EXP 13
  85. TOTAL ER FACILITY + DR EXP 13
  86. CENSUS REGION AS OF 12/31/13
  87. EDUCATION RECODE (EDITED)

It's going to be extremely interesting to see if these have explanatory power for the statistical variation in spending on healthcare!

Speaking of which, what are the variables we're trying to predict?

  1. TOTAL OFFICE-BASED EXP 13
  2. TOTAL OUTPATIENT PROVIDER EXP 13
  3. TOT HOSP IP FACILITY + DR EXP 13
  4. TOTAL ER FACILITY + DR EXP 13

The dataset breaks out healthcare expenses into these four location based categories: office, outpatient, inpatient and ER.

Another modeling choice that needs to be made is whether to try and predict total expenses, or predict each category of expense individually. In my Insight project, I opted to try and predict them individually, to see if I could shed some light on whether the service location affected prices charged. There's reason to think this may be the case: ER sticker shock is widely reported in the news, more so than office visits.

Finally, it's time to reflect a bit on what we've done here. We've isolated explanatory variables that we hope and expect should be predictive of how much a person is charged for healthcare services in four different categories. That's great, but there's a very large number of predictors: over 80. Since the ultimate goal is to build a usable interface to the model that people can use to get a sense of their expected outlays on healthcare, it makes sense to refine and reduce this list. But how?

One way to think about this is to consider what information people are likely to be comfortable sharing about themselves that could help to predict expenses. Another is to focus on what the most predictive variables are, and then ask users to populate these variables in order to get a prediction.

In my Insight project, I opted to balance both these concerns by ranking the explanatory variables in terms of statistical variation explained, cutting off at a reasonable threshold (somewhere in between 10 and 20 variables, corresponding to 10-20 questions asked of the user), and also culling variables that would be particularly burdensome to the user to have to report.

The next thing to think about is: now that we have explanatory variables and explained variables selected, how do we best go about building a model?

My next step is to examine the data feature by feature, to get a better sense of what it looks like and to pick up hunches for what may explain the variation. Along the way I'm sure I'll develop some more detailed hypotheses about healthcare spending, and discover that some of what I implicitly assume to be true about the healthcare market isn't actually so.

First, let's look at the explained variables:

In [ ]:
data["TOTAL_OFFICE-BASED_EXP_13"].describe()

This is interesting. It looks like upto half of respondents spend less than $200 for office-based services in a year. The maximum though, is concerning: over a hundred thousand dollars? Let's take a closer look at that.

In [ ]:
data[data["TOTAL_OFFICE-BASED_EXP_13"]>=100000]

There are 4 people with office based expenses of over 100k. I'd like to take a look to see if these records are bad data, or if they should be kept in the dataset. My prior is that they should be kept in the dataset (the MEPS survey has been running for over 15 years by people who I can only assume know what they're doing). But it doesn't hurt to take a closer look.

Now, I don't want to look at all the extra features in the dataset beyond those that I've decided I care about, so let's go about trimming the dataframe down to contain just these features.

As a matter of good practice, let's write the current dataset to a file, in case we break something and need to come back to this checkpoint.

In [ ]:
data.to_pickle("/home/soumya/research/insight/insight_project/modeling/data/dsm_data_cleaned_columns.pkl")

I've manually written the explanatory variables to a text file located at "/home/soumya/research/insight/insight_project/modeling/code/dsm_exog_.txt". I'm going to read this into memory:

In [ ]:
variables=[]
with open("/home/soumya/research/insight/insight_project/modeling/code/dsm_exog_.txt","r") as f:
    lines=f.readlines()
    for line in lines:
        variables.append(line.strip().upper())
print(variables)

I'm going to make sure there are no duplicates in my list:

In [ ]:
print("There were %d variables to start."%len(variables))
temp=[]
while variables!=[]:
    this=variables.pop()
    if this in temp:
        continue
    else:
        temp.append(this)
while temp!=[]:
    variables.append(temp.pop())
print("There were %d variables at the end."%len(variables))

For future reference, let's write this dataset to disk so we don't have to do all this over again.

In [ ]:
data[variables].to_pickle("/home/soumya/research/insight/insight_project/modeling/data/dsm_data_scoped_variables.pkl")

Let's check this worked:

In [ ]:
data=pd.read_pickle("/home/soumya/research/insight/insight_project/modeling/data/dsm_data_scoped_variables.pkl")
In [ ]:
data.head()

Great! Now let's look at those four data points more closely. N.B: Note the change of assignment to the name "data".

In [ ]:
data[["#_OFFICE-BASED_PROVIDER_VISITS_13","TOTAL_OFFICE-BASED_EXP_13"]][data["TOTAL_OFFICE-BASED_EXP_13"]>=100000]

Ok, if you look at the feature called "# OFFICE-BASED PROVIDER VISITS 13", it looks like these people had 168, 164, 51 and 128 visits each. While this certainly seems anomalous, it's consistent with the high values for amounts billed for office-based expenses.

At this point it's unclear whether these points should be left in the dataset, or thrown out. On the one hand, one would hope that most people don't need over 160 visits to an office-based provider in a year. On the other hand, it's a well-documented fact that a small handful of very ill people account for the vast majority of healthcare costs in the U.S. every year (this is well known to third-party payors). This may make it important to keep in these outliers in the final model.

I'd like to take a closer look at these points. What was the average charge per visit for these respondents?

In [ ]:
(data["TOTAL_OFFICE-BASED_EXP_13"]/data["#_OFFICE-BASED_PROVIDER_VISITS_13"])[data["TOTAL_OFFICE-BASED_EXP_13"]>=100000]

Between $\$650$ and $\$2700$. Presumably these were specialists, and expensive.

If these amounts are not fraudulent, I can imagine that it would be important to leave them in while modeling, to help account for the risk of large healthcare bills. Since there are only four respondents with such high levels of expenditure, let's take a closer look to see if we can find out what drove their spending.

In [ ]:
temp=data[variables][data["TOTAL_OFFICE-BASED_EXP_13"]>=100000]
temp

I'll take a look at these respondents one by one:

In [ ]:
for f in temp.columns:
    print(f,"\t",interpret_vectorized(temp,f).ix[2384])
    
#54 year old male, high blood pressure, low income, high bmi, emphysema, joint pain, limitations in physical functioning
#private insurance
In [ ]:
for f in temp.columns:
    print(f,"\t",interpret_vectorized(temp,f).ix[2834])
    
#67 year old female, relatively high income, heart disease, high blood pressure, angina, stroke, diabetes, arthritis, 
#on both medicare and private insurance, 
In [ ]:
for f in temp.columns:
    print(f,"\t",interpret_vectorized(temp,f).ix[15019])
    
# 10 year old male, moderately high family income, asthma, every other diagnosis is censored. This respondent may be very sick, but the survey censors
#diagnosis information for everyone under 18
In [ ]:
for f in temp.columns:
    print(f,"\t",interpret_vectorized(temp,f).ix[15659])
    
#68 year old female, moderate family income, high blood pressure, high cholesterol, diabetes, joint pain, arthritis, limitations in physical
#functioning

It looks like the outliers consist of respondents who are chronically unwell, and this may be driving heavy treatment regimens.

Given the evidence, I'm inclined to leave these respondents in the dataset for now. While they are a small proportion of the population, their cases may be indicative of the financial consequences of unmanaged illness. I'll have to make sure their inclusion is explicitly mentioned in my model writeup, since I am making a call on whether these points are corrupt data.

Next, let's look at the overall distribution of expenditure on office based services.

In [ ]:
ax=data["TOTAL_OFFICE-BASED_EXP_13"][data["TOTAL_OFFICE-BASED_EXP_13"]>0].hist(bins=100)
ax.set_title("OFFICE BASED EXPENSES")
plt.show()

The graph above is a histogram of billed charges in the dataset. This graph is difficult to read because of the scale. I might try and break it in to a few pieces: one consisting of data points near zero (perhaps one with zero and one with strictly positive values), another consisting of data points that are positive but not too large (say less than 50k or so) and the final piece consisting of the extremely large positive values.

Let's try this:

In [ ]:
def zoom_hist(thevar, therange,bin_width=100):
    '''wrapper around df.hist'''
    temp=data[[thevar]+["FINAL_PERSON_WEIGHT_2013"]]
    bins=range(therange[0],therange[1],bin_width)
    ax=temp[thevar].hist(bins=bins,range=therange, weights=temp["FINAL_PERSON_WEIGHT_2013"])
    ax.set_title("%s between $%s and $%s, bin width is $%s"%(thevar,therange[0],therange[1],bin_width))
    ax.set_xlabel("Charges in $")
    ax.set_ylabel("Count")

I want to point out here that the dataset consists of weighted points; each point carries a weight (the column is "FINAL_PERSON_WEIGHT_2013") such that when the dataset is scaled out pointwise by these weights, it gives a representative sampling of the U.S. population. Since I do have these weights given to me, I've used these in my histograms to get a more accurate sense of the distribution of these features in the U.S. population.

In [ ]:
thevar="TOTAL_OFFICE-BASED_EXP_13"
zoom_hist(thevar,therange=[0,1000],bin_width=10)

This graph clearly shows the extent of the zero clustering. It looks like over 80m people are estimated to have expenses of less than $\$10$ - this is about one-fourth of the population.

Next, let's take a look at the same range, but without the mass at 0.

In [ ]:
zoom_hist(thevar,therange=[1,1000],bin_width=100)

This shows that the modal nonzero expense for office services is about $\$100-\$200$ dollars, with about 32 million people falling into that bucket.

Let's look further out.

In [ ]:
def aggregate(feature,therange):
    temp=data["FINAL_PERSON_WEIGHT_2013"][(data[feature]>=therange[0]) & (data[feature]<=therange[1])]
    return round(temp.sum())

def print_aggregate(feature, therange):
    print("There are %s people in the range %s to %s"%(aggregate(feature,therange),therange[0],therange[1]))
In [ ]:
zoom_hist(thevar,therange=[1000,10000],bin_width=100)
In [ ]:
print_aggregate(thevar,[0,1000])
print_aggregate(thevar,[1000,3000])
print_aggregate(thevar,[5000,10000])

I've used a bin width of $\$100$ again, to keep things consistent. It looks like about 240m people spend less than $\$1000$ a year for office based services. Significant numbers of people (about 50m people) spend between $\$1000$ and $\$3000$ on office based services. After that, each $\$100$ bin contains relatively few people, but these are at fairly high levels of expenditure for an average person or family.

There is quite a bit of mass in the range $\$5000 - \$10,000$. These people, at 8.5 million people, or about $2.7\%$ of the population, are paying a lot for office based healthcare services. It would be interesting to know how much of expenditure at this level is discretionary or non-essential, and how much of it is for relatively routine healthcare - perhaps for badly managed chronic illnesses, etc. In terms of informing policy, there's nothing wrong with the former, but since chronic illness tends to affect a broad section of people - not all of whom can afford to pay upto $\$10,000$ a year for healthcare services, there may be something to worry about if a lot of this high expenditure is for non-discretionary services.

Let's look a bit further:

In [ ]:
zoom_hist("TOTAL_OFFICE-BASED_EXP_13",therange=[10000,50000],bin_width=1000)
In [ ]:
print_aggregate(thevar,[10000,50000])

There is significant mass at these higher levels of expenditure - it's a small proportion of the population (4.5 million, or about $1.4\%$ of the population), but the severity of the costs is non-trivial. It will be very interesting to see if we can find out more about what characterizes people who tend to pay this much for healthcare in a year. Do they have terminal illnesses? Badly managed chronic illnesses? Severe accidents?

Let's look at the tail end of the distribution:

In [ ]:
zoom_hist("TOTAL_OFFICE-BASED_EXP_13",therange=[50000,250000],bin_width=2000)
In [ ]:
print_aggregate(thevar,[50000,250000])

There are about 1.8 million people in this range, and the severity of the costs they face is astonishingly high. Again, it will be extremely interesting to know what about these people drives this level of expenditure.

Next, let's take a look at the other three predicted variables, in the following order:

  1. Outpatient services
  2. Inpatient services
  3. Emergency Room services
In [ ]:
thevar="TOTAL_OUTPATIENT_PROVIDER_EXP_13"
In [ ]:
data[thevar].hist(bins=100)

Clearly, there's a lot of zero clustering going on here as well. Let's break it out into pieces like we did for office based services, and see what we find.

In [ ]:
therange=[0,1000]
zoom_hist(thevar,therange,bin_width=10)
print_aggregate(thevar,therange)

It looks like the vast majority of respondents pay less than $\$10$ for outpatient services.

Let's see what happens if we drop the zero payments:

In [ ]:
therange=[1,1000]
zoom_hist(thevar,therange,bin_width=10)
print_aggregate(thevar,therange)
print_aggregate(thevar,[0,200])

Of the order of 300m people incur charges of a couple of hundred dollars for outpatient services in the year.

Let's zoom in a bit more to that area and see if we can figure out what the modal charge is.

In [ ]:
therange=[1,200]
zoom_hist(thevar,therange,bin_width=10)

The modal charge seems to be $\$30-\$40$ dollars for outpatient services.

Let's now look further to the right to see what the high expenditure cases look like:

In [ ]:
zoom_hist(thevar,therange=[1000,10000],bin_width=100)
In [ ]:
print_aggregate(thevar,therange=[1000,10000])
print_aggregate(thevar,therange=[3000,10000])

There are about 5.6m people in this range, which is about $1.75\%$ of the population. There are about 1.1m people, or $0.34\%$ of respondents that are paying over $\$3,000$ in this range.

Let's take a look at the rest of the data.

In [ ]:
zoom_hist(thevar,therange=[10000,50000],bin_width=1000)
In [ ]:
print_aggregate(thevar,therange=[10000,50000])

Not many people here - about 77,000. The maximum amount paid by any respondent is $\$26,703$, so we won't find mass any further to the right.

It looks like outpatient expenditure follows a similar pattern to office based expenditure, although the severity is less extreme.

Let's turn to inpatient expenditure.

In [ ]:
thevar="TOT_HOSP_IP_FACILITY_+_DR_EXP_13"
In [ ]:
data[thevar].describe()

Alright, so we have a max of about $\$320,000$. That's extremely high!

Let's go through the same steps as before:

In [ ]:
zoom_hist(thevar,therange=[0,1000],bin_width=10)

Again, most people pay less than $\$10$.

What if we drop the mass at 0?

In [ ]:
zoom_hist(thevar,therange=[1,1000],bin_width=10)

Let's move out a bit further:

In [ ]:
zoom_hist(thevar,therange=[1000,10000],bin_width=100)

Ok, quite a lot of mass in this range. Charges in the thousands of dollars seem more likely in the context of inpatient services, since this usually includes procedures like surgeries.

Let's look out past $\$10,000$.

In [ ]:
zoom_hist(thevar,therange=[10000,50000],bin_width=1000)
In [ ]:
print_aggregate(thevar,therange=[20000,50000])

This is consistent with the kinds of amounts we're used to hearing about being charged for surgeries, etc. About 4m people seem to have inpatient services costing over $\$20,000$; these amounts are quite severely high. Let's see if it gets worse for anyone.

In [ ]:
zoom_hist(thevar,therange=[50000,350000],bin_width=10000)
In [ ]:
print_aggregate(thevar,therange=[50000,350000])

For the 1.4m people in this range, the amounts charged are more than the median household's annual income in the United States. Anyone that falls into this range - and it looks like many people do - will be at great financial risk.

I'd really like to know if the people who end up in this range have anything in common, or whether it's mostly random. Are these life-threatening accidents? Terminal illness? Badly managed chronic illness? Being able to tell the difference, if there is any, could be very informative for healthcare policy, and for advising healthcare consumers about their financial risk.

Let's move on to the final predicted variable, ER expenditure.

In [ ]:
thevar="TOTAL_ER_FACILITY_+_DR_EXP_13"
In [ ]:
data[thevar].describe()

The maximum charge for ER services is $\$63,179$. This is a lot lower than the maximum for inpatient services, but still very high.

Let's look at this close up.

In [ ]:
zoom_hist(thevar,therange=[0,1000],bin_width=10)

Fewer people at $\$0$ than for outpatient or inpatient expenditure, but more than for office based expenditure.

Let's drop the $\$0$ charges and see what it looks like.

In [ ]:
zoom_hist(thevar,therange=[1,1000],bin_width=10)

There are more people in this range than there were for inpatient services, though less than for office based services and outpatient services. A side-by-side comparison:

In [ ]:
def compare_hist(therange,bin_width):
    
    bottom=therange[0]
    top=therange[1]
    
    plt.subplot(2,2,1)
    temp=data["TOTAL_OFFICE-BASED_EXP_13"]
    weights=data["FINAL_PERSON_WEIGHT_2013"]
    ax=temp.hist(range=therange,bins=range(bottom,top,bin_width), weights=weights)
    ax.set_title("Office")
    plt.xticks(rotation=45)
    
    plt.subplot(2,2,2)
    temp=data["TOTAL_OUTPATIENT_PROVIDER_EXP_13"]
    ax=temp.hist(range=therange,bins=range(bottom,top,bin_width), weights=weights)
    ax.set_title("Outpatient")
    plt.xticks(rotation=45)

    plt.subplot(2,2,3)
    temp=data["TOT_HOSP_IP_FACILITY_+_DR_EXP_13"]
    weights=data["FINAL_PERSON_WEIGHT_2013"]
    ax=temp.hist(range=therange,bins=range(bottom,top,bin_width), weights=weights)
    ax.set_title("Inpatient")
    plt.xticks(rotation=45)
    
    plt.subplot(2,2,4)
    temp=data["TOTAL_ER_FACILITY_+_DR_EXP_13"]
    weights=data["FINAL_PERSON_WEIGHT_2013"]
    ax=temp.hist(range=therange,bins=range(bottom,top,bin_width), weights=weights)
    ax.set_title("ER")
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()
In [ ]:
compare_hist([1,1000],10)

While the overall utilization of ER services is lower than for office based services at each cost level, there seem to be many more people using ER services at the higher end of this range than either outpatient or inpatient services.

It's hard to say whether this means anything at this point, but it is interesting nonetheless.

Medical emergencies often come with significant sticker shock, so I'm curious to see what this looks like at much higher levels of expenditure.

In [ ]:
zoom_hist(thevar,therange=[1000,10000],bin_width=100)

Let's go a bit further...

In [ ]:
zoom_hist(thevar,therange=[10000,50000],bin_width=1000)

And a bit further ......

In [ ]:
zoom_hist(thevar,therange=[50000,70000],bin_width=1000)

Over 4000 people are estimated to see bills in excess of $\$60,000$ in a year.

I'd like to compare these ranges across categories for a moment.

In [ ]:
compare_hist([1000,10000],100)

Again, ER services are the second most frequently used at each cost level in the $\$1000 - \$10,000$ range.

Let's look at the $\$10,000 - \$50,000$ range.

In [ ]:
compare_hist([10000,50000],1000)

In the $\$10,000 - \$50,000$ range, ER services give way to inpatient services to become the third most frequently used at each cost level. (This is an approximate statement, I'm not checking each bin.)

Finally, let's look at the tail end:

In [ ]:
compare_hist([50000,250000],1000)

It looks much the same in this range.

My main takeaway from doing this is that the expenditure for these four location based services follows a reasonable pattern for most people. That is, the majority of people will see charges for healthcare expenses that are reasonably insurable if on a plan with a low deductible, but that may bite for the average household on a plan with a relatively high deductible.

One important issue these plots raise is the following: there is a minority of respondents that incur charges for healthcare that are so high as to be clearly financially ruinous for the median household. Questions to ask include:

  1. Do the respondents who incur these extremely high charges have the financial means to bear them? Or are they likely to go bankrupt?
  2. What characterizes the respondents with extremely high healthcare costs? Do certain illnesses come into play? Do these charges come at random? Are some people at greater risk than others? Would it help anyone if they were advised that they were at elevated financial risk because of some characteristic they have? Should they have a particular health insurance plan? Can I use any of these insights to build a better planning tool?

These are important questions, and it would be fantastic if I could use my Insight project to shed some light on them, whether as part of the modeling process, or (even better) by integrating them into my webapp.

In the next post, I'll take a look at the explanatory variables.

Click to read and post comments

Dec 21, 2015

Building the dataset

The next step in giving my project legs was to actually get data into my analysis stack. For this project, I planned on using a pandas/scikit-learn/statsmodels stack.

To this end, I needed to get a hold of the code dictionary available on this webpage, and use it to rewrite the 2013 MEPS dataset as a comma separated values file.

The first thing to do is to instantiate a scraper using the BeautifulSoup and requests libraries.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.pylab.rcParams['figure.figsize'] = (10, 6)

I may run this analysis on several MEPS datasets in the future, so I'm going to hardcode a number of pages into a dictionary for easy access later. Here's a function that accesses the dictionary using project names.

In [2]:
def theurl(name):
    answer={"2012":"http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H155", 
            "2013":"http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H163", 
            "population": "http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H157", 
            "medical": "http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H154", 
            "risk": "http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H140", 
            "employment": "http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H131", 
            "jobs": "http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H158", 
            "person_round_plan": "http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H153"}
    return answer[name]

So for instance, calling it on "2013" yields:

In [3]:
theurl("2013")
Out[3]:
'http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_codebook.jsp?PUFId=H163'

Let's write a function that returns the parse tree for the html we need:

In [4]:
def make_soup(url):
    response=requests.get(url)
    soup=BeautifulSoup(response.content)
    return soup

And let's check that this worked for our page:

In [5]:
this=make_soup(theurl("2013"))
In [6]:
this.title
Out[6]:
<title>Medical Expenditure Panel Survey PUF Codebook</title>

I fiddled around with this object for a while and discovered that the data I needed all lay inside bits tagged "font".

In [7]:
this.findAll('font',limit=60)
Out[7]:
[<font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font color="#660000">::</font>,
 <font class="smallBlack" face="Arial">MEPS HC-163</font>,
 <font class="smallBlack" face="Arial">2013 FULL YEAR CONSOLIDATED DATA CODEBOOK</font>,
 <font class="smallBlack" face="Arial">DATE:   October 7, 2015</font>,
 <font class="smallBlack" face="Arial">940   </font>,
 <font class="smallBlack" face="Arial">941   </font>,
 <font class="smallBlack" face="Arial">PERS ELIGIBLE FOR ACCESS SUPPLEMENT-R4/2</font>,
 <font class="smallBlack" face="Arial">241   </font>,
 <font class="smallBlack" face="Arial">242   </font>,
 <font class="smallBlack" face="Arial">MILITARY FULL-TIME ACTIVE DUTY - R3/1</font>,
 <font class="smallBlack" face="Arial">243   </font>,
 <font class="smallBlack" face="Arial">244   </font>,
 <font class="smallBlack" face="Arial">MILITARY FULL-TIME ACTIVE DUTY - R4/2</font>,
 <font class="smallBlack" face="Arial">245   </font>,
 <font class="smallBlack" face="Arial">246   </font>,
 <font class="smallBlack" face="Arial">MILITARY FULL-TIME ACTIVE DUTY - R5/3</font>,
 <font class="smallBlack" face="Arial">473   </font>,
 <font class="smallBlack" face="Arial">474   </font>,
 <font class="smallBlack" face="Arial">ANY LIMITATION WORK/HOUSEWRK/SCHL-RD 3/1</font>,
 <font class="smallBlack" face="Arial">475   </font>,
 <font class="smallBlack" face="Arial">476   </font>,
 <font class="smallBlack" face="Arial">ANY LIMITATION WORK/HOUSEWRK/SCHL-RD 5/3</font>,
 <font class="smallBlack" face="Arial">734   </font>,
 <font class="smallBlack" face="Arial">735   </font>,
 <font class="smallBlack" face="Arial">SAQ 12 MOS: # VISITS TO MED OFF FOR CARE</font>,
 <font class="smallBlack" face="Arial">786   </font>,
 <font class="smallBlack" face="Arial">787   </font>,
 <font class="smallBlack" face="Arial">SAQ 4WKS: FELT CALM/PEACEFUL SF-12V2</font>,
 <font class="smallBlack" face="Arial">774   </font>,
 <font class="smallBlack" face="Arial">775   </font>,
 <font class="smallBlack" face="Arial">SAQ: HLTH LIMITS CLIMBING STAIRS SF-12V2</font>,
 <font class="smallBlack" face="Arial">834   </font>,
 <font class="smallBlack" face="Arial">835   </font>,
 <font class="smallBlack" face="Arial">SAQ: DATE COMPLETED - MONTH</font>,
 <font class="smallBlack" face="Arial">836   </font>,
 <font class="smallBlack" face="Arial">839   </font>,
 <font class="smallBlack" face="Arial">SAQ: DATE COMPLETED - YEAR</font>,
 <font class="smallBlack" face="Arial">772   </font>,
 <font class="smallBlack" face="Arial">773   </font>,
 <font class="smallBlack" face="Arial">SAQ: HLTH LIMITS MOD ACTIVITIES SF-12V2</font>,
 <font class="smallBlack" face="Arial">790   </font>,
 <font class="smallBlack" face="Arial">791   </font>,
 <font class="smallBlack" face="Arial">SAQ 4WKS: FELT DOWNHEARTED/DEPR SF-12V2</font>]

These tagged pieces contain the variable description, preceded by the start and stop columns for the position of the field in the fixed-width file.

For example:

<font class="smallBlack" face="Arial">353   </font>,
<font class="smallBlack" face="Arial">354   </font>,
<font class="smallBlack" face="Arial">CANCER DIAGNOSED - MELANOMA (>17)</font>,

The next function extracts all the tags in the parse tree with attribute "font" and collects them in a list, that I (in very poor taste) called final. In my defense, the list is really nothing, and nothing is ever final, so there you go.

I then use a helper function I called "prune_list" to get rid of the crud I didn't want to deal with that also happened to be part of a "font" tag. Finally, this function writes the code dictionary table to the file provided as the second argument to the function.

In [8]:
def get_data(soup,target):
    result=[]
    for row in soup.find_all("font"):
        result.append(row.contents)
        final=[(term[0].replace(u'\xa0',u'')).replace(',','') for term in result]
        #get rid of junk
        theterms=["::",
                  "MEPS HC-155",
                  "MEPS HC-163",
                  "MEPS HC-157",
                  "MEPS H154 CODEBOOK",
                  "MEPS H140 CODEBOOK",
                  "MEPS H131 CODEBOOK",
                  "MEPS HC-150",
                  "MEPS HC-158",
                  "MEPS H153 CODEBOOK",
                  "2012 FULL YEAR CONSOLIDATED DATA CODEBOOK",
                  "2013 FULL YEAR CONSOLIDATED DATA CODEBOOK",
                  "2013 FULL YEAR POPULATION CHARACTERISTICS CODEBOOK",
                  "2012 MEPS MEDICAL CONDITIONS FILE",
                  "2002-2009 RISK ADJUSTMENT SCORES FILE",
                  "EMPLOYMENT VARIABLE IMPUTATION FILE",
                  "2012 JOBS FILE CODEBOOK",
                  "2013 JOBS FILE CODEBOOK",
                  "2012 PERSON ROUND PLAN FILE",
                  "DATE:   August 25 2014",
                  "DATE:   August 25 2015",
                  "DATE: August 21 2014",
                  "DATE: August 4 2014",
                  "DATE:     March 6 2015",
                  "DATE: December 15 2014",
                  "DATE:    April 10 2013",
                  "DATE:   August 12 2014",
                  "DATE: February 13 2015"]                          
        final=prune_list(final,theterms)
        with open(target,"w") as f:
            f.write("start,end,variable\n")
            for skip in range(0,len(final)-1,3):
                f.write("%s,"%final[skip])
                f.write("%s,"%final[skip+1])
                f.write("%s\n"%((final[skip+2]).lstrip()).rstrip())
        print("Done")
        temp=pd.read_csv(target)
        temp=temp.sort("start").copy()
        temp.index=list(range(len(temp)))
        return temp

Here's the definition of prune_list for reference:

In [9]:
def prune_list(thelist, theterms):
    for theterm in theterms:
        while True:
            try:
                thelist.pop(thelist.index(theterm))
            except:
                break
    return thelist

Finally, here's the function that takes the MEPS dataset, and a dataframe containing the column dictionary, and writes the dataset out into a comma separated values file:

In [10]:
def write_table(data,header,target,short="No"):
    if short!="No":
        data=data[:10]
    data=prune_list(data,'')
    header=header.sort("start").copy()
    header.index=list(range(len(header)))
    with open(target,"w") as f:
        for element in header["variable"]:
            f.write("%s, "%element)
        f.write("\n")
    for row in data:
        u=[row[(header["start"].iloc[place]-1):(header["end"].iloc[place])] for place in header.index]
        with open(target,"a") as f:
            writer=csv.writer(f)
            writer.writerow(u)
            #print("Done with row %s"%data.index(row))

Now that we have the dataset written to disk in a format that I can use, in the next post I'll take a look at what the dataset contains and try to get a sense of the questions I might be able to answer with it.

Click to read and post comments

The dataset

In my last post, I looked at several datasets that CMS provides, and worried that none of them were adequate to answer the question at hand: what drives healthcare spending in the United States?

What then jumpstarted my Insight project in its first week was a tip from one of the Insight NYC program directors, Friederike Schueuer, who suggested I look into the Medical Expenditure Panel Survey dataset.

This turned out to be something of a jackpot. The MEPS panel dataset has been assembled (under different names) since 1996, and surveys about 35,000 people a year. It collects detailed information on (amongst other things) their demographics, household composition, employment, income, socio-economic status, insurance status, health status (including chronic conditions) and how much they were charged for healthcare over the year in four different provider settings: office-based, outpatient, inpatient, and emergency room.

What turned out to be harder about working with this dataset was the fact that the data was only provided in a flat-file format with no delimiters. It turns out that the term for such files is "fixed-width", where each record appears in its own row (terminated by the carriage return character) and the entries for each column appear within fixed columns in the dataset. For example, the record ID for each record might occupy exactly the first five characters of the record, the insurance status might be recorded using characters in columns 101-105, etc. The only problem with this was that the only place I seemed to be able to find the correspondence between variables and column numbers was in an html file on the web entirely separate from the dataset itself. For example, this is the codebook for the 2013 dataset, and there were over 1700 variables in the dataset.

In the next post, I'll detail how I went about getting a hold of the code dictionary by using a web scraper, and then rewriting the MEPS dataset into a comma separated file so I could load it into an analysis tool like (Python) pandas.

Click to read and post comments

Dec 20, 2015

Finding data to use

In this post, I'll outline how I made decisions about the data I chose to use to build the model behind my Insight project.

There were a few choices to be made, and a lot of constraints to consider. Something I came across in looking up data sources for this project was the contrast between the richness of the data currently available when it came to healthcare expenditure in the public (Medicare, Medicaid) sector, and how useful this data was for solving the problem I wanted to address.

To be more specific, there is a wealth of data available from CMS that sheds light on utilization and expenditure patterns for Medicare/Medicaid patients. However, it's hard to say to what extent patterns in this market are reflected in the private payor market - for patients with private individual plans or employer sponsored insurance. It's reasonable to be concerned that these two markets are sufficiently different in terms of market dynamics that what drives costs in the public payor market won't translate to the private payor market.

One dataset that caught my attention despite this issue was the Medicare Provider Utilization and Payment Data set. This is a highly detailed dataset reporting information that includes: NPI (National Provider ID), provider location information (physical address), provider specialty, provider Medicare participation, HCPCS (Healthcare Common Procedure Coding System) code and description, average and standard deviation of reimbursement under Medicare (including coinsurance, deductible and any third party payments) and the average and standard deviation of the submitted charge (what the provider actually bills). Medicare reimbursement rates are commonly held to track operating costs for providers, and so the ratio of submitted charges to Medicare reimbursement for each procedure code and in each location can be used as a rough proxy for the markups over cost seen at these facilities.

Amongst other things, the markup proxy may be useful as a way to estimate billed charges for consumers, if the Medicare reimbursement rates for a specific condition are already known.

More interestingly, (spoiler alert!) one of the conclusions I drew at the end of my Insight project was that I expect the relative market power of the provider and the payor to be a strong determinant of what a healthcare consumer is billed. (It turned out that demographic, socioeconomic and health status information were useful, but nowhere near sufficient to explain the statistical variation in billed charges.) Given the results of my project, I've come to believe that considering market power needs to be a crucial part of the discussion on managing healthcare costs and improving healthcare outcomes.

This perspective is supported by a recent paper by Cooper, Craig, Gaynor and van Reenen from earlier this month. They show that the primary determinant for the variation in healthcare spending for privately insured consumers is providers' transaction prices; further, hospital prices are positively associated with indicators of hospital market power. They show that even after controlling for a multitude of demand and cost factors, prices at hospitals enjoying monopoly power in a hospital referral region are 15.3% higher than in markets with four or more hospitals.

Another exciting feature of this paper is their use of a novel dataset provided by the Healthcare Cost Institute (HCCI) containing claims data submitted by Aetna, Humana and United Healthcare. Why is this exciting?

The Kaiser Family Foundation has compiled data estimating that in 2014, Medicare covered 13% of Americans, while Medicaid covered 19% of Americans. Combined, this accounts for about a third of the U.S. population, which contrasts with 49% of Americans covered under an employer sponsored plan. While Medicare and Medicaid are likely to be the most powerful payors in the country (in terms of pricing power), private ESIs are likely to have much less pricing power relative to providers, while covering many more people. Therefore, one can imagine that understanding the private payor market will be crucial to understanding what drives healthcare expenditure in the U.S.

The Cooper-Craig-Gaynor-van Reenen paper has also been featured in the news, which may make for better reading if you're not in the mood for an academic paper.

For yet another perspective on the importance of market power in controlling healthcare costs, see Nicholas Bagley's recent post drawing attention to the obstructions strewn in the path of those seeking more and better data on the private payor market.

Given all this, I'm somewhat interested to see whether the markup proxies that can be obtained from the aforementioned Medicare Provider Utilization and Payment Data dataset can be sensibly used (perhaps after joining with another database with a more comprehensive list of providers in various referral regions) to estimate providers' pricing power.

It was unfortunately impossible to run this experiment within the short timeframe of an Insight project, but it's definitely something I'd like to investigate further in the coming months.

Well, I'm currently at a loss as to how to proceed. In the next post, I'll detail the dataset that I did discover to be useful, and what was helpful about it.

Click to read and post comments

My Insight project

In the next set of posts, I'm going to introduce my Insight project.

Prompted by Jake Klamka's suggestion that we try to work on something that we feel would be impactful, I gave some thought during the first week of Insight to the question of the healthcare market in the United States.

This was far from being the first time that I'd had to think about this question. As an undergraduate, I remember reading a comparative analysis of healthcare expenditure in OECD countries, in which the point was made that the U.S. led the OECD in healthcare expenditure as a percentage of GDP, but was far from leading OECD countries in terms of allocation efficiency.

Since the ACA was passed, I've been quite interested to read economic policy papers and blogs covering the healthcare market. One that I've found quite interesting is The Incidental Economist. (This is not an endorsement of TIE bloggers' conclusions, views, or anything else.)

The primary market failure I've been interested in for a while is the lack of pricing transparency in healthcare. Anyone who's tried to get an estimate or quote from a hospital for a visit, test or procedure before said visit, test or procedure will be acutely aware of how frustrating this exercise is. Many people have found that asking for this information almost always nets you at best an evasive answer along the lines of ''it depends on a million things, just come by and we'll send you a bill later'', or at worst an outright refusal to answer the question. The uncertainty introduced by the combination of this behavior, and the tremendous and variable markups on healthcare services make for poor allocative efficiency, I think. In plain English, it's not good business.

Questions I wanted to answer, therefore, included: what drives prices in the healthcare market? Is there a way to predict ahead of time what a person can expect to be billed for healthcare services over a year? What explains the variation in billed charges for healthcare between different people and different providers? Is there a way to provide this information to consumers to enable them to make better choices when it comes to their provider, treatment or lifestyle?

My initial goal for my Insight project was a bit ambitious: I wanted to build an application for consumers to explore what their billed charges for healthcare would be over the next year. A natural application of these insights would be to help consumers compare health insurance plans.

Over the next few posts, I'll outline how I went about doing this.

Click to read and post comments

Introducing Insight

In September of this year, I took a leave of absence to do the Insight Data Science program in New York City. Insight is a postdoctoral program that helps Fellows to transition from academia to doing data science in industry.

I can't say enough good things about the Insight program, but let me at least offer a few impressions.

The minimum requirements to enter include a Ph.D., and the program tends to accept applicants who are already well prepared to be data scientists (perhaps 90-95% of the way there), and polishes them for the last mile of the transition. The program is fairly selective; we were told that for the September session, Insight received over 900 applicants, and accepted about 70 people. While this is certainly not to say that people who don't make it into Insight aren't awesome, the selectiveness does ensure a talented cohort, and my fellow Fellows (this term being a running pun throughout the program) were one of the best things about the program.

Once in the program, Fellows take a week or so to decide on what they'll do for their Insight project, which is a data science project that they'll showcase to mentoring companies towards the end of the program. We're encouraged to try working on something that is do-able in a 2-3 week timeframe, but that is also impactful in some way. There is also the option to do one of a handful of consulting projects, which consist of well defined projects that real, existing startup companies need to get done, but don't have the manpower or funding to implement.

We then spend the next 2-3 weeks implementing our chosen projects in 1 week iterations (affectionately called MVP's - as used in agile development and popularized by Eric Ries). Not all of our time is available for developing our projects; during this period we are visited by a lot of mentor companies that are interested in telling us about themselves, and what they do with data scientists. I think I slept an average of 4 hours a night during these weeks, and I definitely pulled a couple of all-nighters.

In the final phase of the program, we pivot to demo-ing at mentor companies, as well as prepping for the interview stack that most companies have for data scientists. We've been informed that that the bar for admission to be a practicing data scientist in industry has risen recently, and that the interview loops have become more rigorous and demanding at a lot of companies. This makes a lot of sense: with competitor programs setting up shop, there's been an increase in the supply of candidates like us into the data scientist market. The rapid creation of other bootcamps has also increased the supply of data scientists with different, including non-Ph.D., backgrounds. Finally, the emergence of many university-based data science and analytics masters programs (here's one example close to home that I've helped to build) promises to channel yet more talent into the marketplace.

Challenge accepted.

Click to read and post comments

Introducing this blog

I've set up this blog in order to encourage me to write on research topics that I'm interested in, and on projects that I'm currently doing.

I've found in the past that teaching has been a great way of learning new things well. This blog is a way to force me to reason through and explain what I'm working on in ways that are clear, detailed and that relate to concrete project goals.

Click to read and post comments

Nov 28, 2015

Hello World!

This is my first post!

def assigns(m,c):
    def theline():
        return (lambda x: m*x+c)
    return theline

This function takes parameters \(m\) and \(c\) and returns a constructor for a linear function \(y=m*x+c\).

Let's try this out:

newline_generator=assigns(1,1)
theline=newline_generator()
theline(1100)
1101

That's all she wrote.

Click to read and post comments
Next → Page 1 of 2

GitHub · LinkedIn · ·

© Copyright 2015 - 2016, Soumya D. Sanyal.